anders.com: words: apache

anders.com: words: apache

article for maximum linux magazine

[ home ]
[ anders ]
[ resume ]
[ choppers ]
[ projects ]
[ netatalk ]
[ route66 ]
[ javascript ]
[ webgallery ]
[ mockMarket ]
[ merits ]
[ dailyBulletin ]
[ panacea ]
[ words ]
[ pictures ]
[ movies ]
[ contact ]

(note, this is a proof. the actual article as printed was a revised version of this proof. you can order back issues of maximum linux from https://secure.imaginemedia.com/)

MaximumLinux Article: Apache!
By Anders Brownworth

1. intro

What's the Internet worth without the World Wide Web? So what's an operating system worth without a webserver? Want to set up your own website? Thought so!

In the world of webservers, there is Apache, and then there is everything else. Apache is easily the world's most popular browser, and for good reason. Apache, like all good software, is open sourced, so it has benefited from thousands of eyeballs scrutinizing the source code for a long time. Honestly it's not the cleanest code in the world, but so many people have hammered on it that it is rock solid!

In this article, I'll take you through the process of compiling and installing the latest version of apache. There are quite a few specialized things you need to know to build and run an Apache webserver, especially if you want to do it right.

Any website ******* bandwidth considerations. Remember, most consumer Internet connections are fast for downloads, but slow for uploads. If you are running a webserver, the overwhelming majority of traffic will be "upload" traffic, so you should think twice about putting a big website on your DSL or Cable Modem connection. Both of these technologies (with the exception of SDSL) have a high download bitrate but a slow upload bitrate. The same concept holds true for 56K modems, although if your Internet connectivity is through a 56K modem, your website is going to be REALLY slow for the rest of the world! But I digress.

Virtually all Linux distributions come with Apache in some form or another. You probably have some sort of default Apache installation on your system right now, but are you sure you have the latest version of the code? And do you really know what options were chosen for you? Hang on because we are going to do an Apache install the way God intended, from the source!

2. the basics: getting a server set up

The latest version of the source code is always available from the Apache Group by pointing your FTP client at: ftp.apache.org. (***use a local mirror ***) As of this writing, the latest version of Apache is 1.3.9 so we will assume this version in this article. Download a copy of the source code (apache_1.3.9.tar.gz) and save it in a nice convenient location, such as /usr/local/src. It's probably a good idea to keep all copies of your installed software here so you can easily find it and check version numbers.

Now we need to uncompress and untar the package. (tar -xzvf apache_1.3.9.tar.gz) You could uncompress the file first with gzip -d apache-1.3.9.gz but the newer versions of tar include gzip compression and decompression on the fly with the "-z" flag.

Once you have the source ready to go, we need to compile the program. Go into the apache-1.3.9 source directory (cd apache_1.3.9) and start the configuration script. (./configure) This script will rip through your computer and figure out a bunch of critical settings such as what compiler you are using and what libraries you have installed. It will probably complain about the lack of configuration options, but don't worry about that unless you want to install apache in a non-standard location.

Next, we have to compile the package. (everything installed under /usr/local/apache) To compile everything, type "make" in the source directory. This will compile everything in the source tree.

To install the package, you will need to run a "make install" as root. If you aren't already root, you can use the command "su" to gain root privelages for this step. The distribution will end up in the default location (/usr/local/apache) or wherever you have previously configured.

There are three files that just got installed that you should know about. The most important command you will use is "apachectl" or Apache control. This little program takes arguments like "start" and "stop" and essentially controls the server. The actual webserver is called "httpd". This program is started and stopped by apachectl. (although you could do it by hand if you wanted to) httpd needs at least one configuration file, called "httpd.conf" that you will find in "/usr/local/apache/conf/httpd.conf".

To get things running, we will need to go over the major configuration options in the httpd.conf file. We'll skip most of the more obscure options and set only the critical ones. Most of the important items are at the top of the httpd.conf file.

In your installation directory, you should have a configuration directory called "conf". (/usr/local/apache/conf) Everything used to configure the webserver is in this directory. The main configuration file for Apache is called "httpd.conf". Open this file up in an editor. (you will need to be root as the install process writes these files as root) We are going to leave most of the options at their default. Assuming a standard installation (installed in the directory /usr/local/apache) you really won't have to change much to get a working server.

ServerRoot "/usr/local/apache"

If you didn't install everything in the default location, tell apache where things are with the ServerRoot option.

User nobody
Group nogroup

Just incase you leave a door open, or someone figures out how to trick the webserver into doing something malicious, the server is run as a user with very little permissions. If someone were to break in and cause the server to try to rewrite your password file, we set the server to run as an unprivelaged user. The user "nobody" can't overwrite your password file, so the attack would fail. Most major Linux distributions have a user called "nobody" and a group called "nogroup", but make sure these users exist on your system.

DocumentRoot "/usr/local/apache/htdocs"

The document root is the location of all your html documents. The default location has a sample web page so this will probably do for the time being.

ServerAdmin webmaster@maximumlinux.com

The email address you use in the ServerAdmin section is given as contact information in the event of a CGI not responding correctly. This is usually set to a general delivery mailbox such as webmaster@yourdomain.com.

ServerName www.maximumlinux.com

You can't just invent names! It's a good idea for the server to know what it's name is in the event that it has to doctor up URLs, but just by typeing www.maximumlinux.com as the ServerName does not make you maximumlinux.com! Set this to a domain name that will resolve to your server. If you would like your own special name, you will need to purchase a domain name (about $35 a year) and establish nameservice on the Internet. There are a bunch of companies out there that will do this for you.

Now for the moment of truth. Crank up the webserver with the command:

/usr/local/apache/bin/apachectl start

If all goes well, the server should now be started. You can check that it is running by running:

ps -axf |grep httpd

You should see an httpd daemon launched with 5 children hanging off of it.

 6106 ?        S      0:00 /usr/local/apache/bin/httpd
 6107 ?        S      0:00  \_ /usr/local/apache/bin/httpd
 6108 ?        S      0:00  \_ /usr/local/apache/bin/httpd
 6109 ?        S      0:00  \_ /usr/local/apache/bin/httpd
 6110 ?        S      0:00  \_ /usr/local/apache/bin/httpd
 6111 ?        S      0:00  \_ /usr/local/apache/bin/httpd

The main process actually doesn't deal with web connections, it just makes sure that the proper ammount of child servers are hanging around. The reason you want a child process hanging around to deal with web connections instead of the main process dealing with everything is because spawning new processes takes time. So for the quickest server response time, apache leaves a bunch of spare servers hanging around to deal with requests as they come in. The root process will spawn new servers automatically to keep ahead of the load. You can change the minimum and maximum spare servers in the httpd.conf file.

Now go to a web browser and try hitting your server. Idealy you want to test access from a computer other than your server. If everything went well, you should get the dafault web page saying "It worked!" (you have also just put a copy of the apache documentation online as that comes with the default install as well.

3. fooling with some of the options

user directories

There are about 100,000 differient things you can do with apache. I'll touch on some of the more important ones. The httpd.conf file has many configurable options, but be sure you know what you are setting when you are playing with this file because it isn't very hard to make your system vulnerable to attacks if you don't do the right thing.

UserDir
Once you have a webserver going, it's kind of nice to give your users a way to publish on the web without having to bother you for write access to the main document root. You have probably seen web addresses with a tilde character (~) in them. By default, when apache looks for http://www.maximumlinux.com/~anders/, it goes to the home directory for the user "anders" and looks for a directory called "public_html". (this is changed with the UserDir directive in the httpd.conf file) Generally I like to change this to "www" because it's easier to type, but set it as you like.

DirectoryIndex
When a user types in www.maximumlinux.com, the webserver must know what html page to send out because the user didn't ask for one specifically. Traditionally, the default page is called "index.html" varying flavors such as "index.htm" and "default.htm" have been pushed on us by an un-named monopolistic company. If your users are using FrontPage or other WYSIWYG HTML editors, you may want to include other default documents in the DirectoryIndex directive. Just separate them with a space.

DirectoryIndex index.html index.htm default.htm

AccessFileName
You know when you get that username/password box on some websites? Access restriction can be set up on a per file or per directory basis with an access file. By default this file is called .htaccess, and can contain many of the same parameters found in the httpd.conf file.

As an example, we can block access to a directory by using the AllowOverride All directive and placeing a .htaccess file in the directory that you want to block. Make your .htaccess file look like this:

AuthName "Access"
AuthType Basic
AuthUserFile /usr/local/apache/conf/htpassword
AuthGroupFile /usr/local/apache/conf/htgroup

<limit GET>
require group users
</limit>

Essentially the first part of this file tells what type of authentication to use, and where the password and group files are. The second part limits access to the directory to people who have given a good password and are in the group "users".

Let's say that the username we want to use is "john" and the pasword we want is "bozo". The /usr/local/apache/conf/htgroup file listed below shows us that both "fred" and "john" are in the group "users".

users: fred john

The last thing we need to do is create an htpassword file with john's username and encrypted password in it. Fortunatly for us, there is a little utility distributed with apache that creates and manipulates the htpassword file.

/usr/local/apache/bin/htpasswd -c /usr/local/apache/conf/htpassword john

The command htpasswd -c creates a password file with john as the username and prompts for the password. Remember that if you want to change john's password, or add any new users, leave out the -c (htpasswd /usr/local/apache/conf/htpassword john) or you will over-write your password file! When you have typed and confirmed the password, you should be left with a file that looks somewhat similar to this:

john:3LttqYHM.WA96

Now you should have some simple authentication working, so give it a try and see if you can login with the new web account. (remember, in this case, web users are not system users. In other words, they are not in the system password file (/etc/passwd) and can not open a shell on your machine.)

ScriptAlias
There are two ways to get CGI programs running. The old way and the new way! Initially, you would just designate a directory (/cgi-bin) as your CGI directory and everything in there was treated as a CGI program. More reciently, CGI scripts have been specified by extension. For instance, you might specify every file ending in ".cgi" to be a cgi program. ScriptAlias will let you set up a CGI directory, so as an example, we could do this:

ScriptAlias /cgi-bin/ /usr/local/apache/htdocs/cgi-bin/

If we had a perl program called date.pl that printed today's date, we could stick it in this directory and access it in a web page with /cgi-bin/date.pl.

Alternativly we could rename it date.cgi and stick it anywhere if we add "ExecCGI" to the Options line in the directory root configuration section and adding the cgi-script handler to associate .cgi files as scripts.

<Directory "/usr/local/apache/htdocs">
Options Indexes FollowSymLinks ExecCGI
AllowOverride None
Order allow,deny
Allow from all
>/Directory>

AddHandler cgi-script .cgi

When writeing CGI scripts, don't forget to build up a proper HTML header. The following is the source code to the date.cgi script.

#!/usr/bin/perl

print "content-type: text/html\n\n";
print scalar localtime;
print "\n";

note: if you want, you can set index.cgi as a DirectoryIndex so you can use a cgi script for the default pages of your website!

Redirect
Things change. There will come a time when you want to redirect requests to another place. It would be easy enough to write a web page that says "this page moved" or even one that does a meta refresh to automatically redirect browsers to another place, but in many situations, you just want every request to a specific directory to be universally redirected to another place. This is done with the Redirect directive.

Redirect /this_directory http://www.maximumlinux.com/

In this example, every URL beginning with "/this_directory/" is redirected to http://www.maximumlinux.com/.

ip aliasing and virtual servers

VirtualServer / NameVirtualHost
You can run more than one website from the same machine. Essentially there are two ways to do this. You can set up aliased IP addresses on your server and attach a webserver to each address, or you can use just one IP address and depend on the web browser to tell the server what website it wants. In either case, proper DNS setup is critical.

If you want to use aliased IP addresses, you would set up an aliased IP address for each server you want and declare a virtual host for each in the httpd.conf file. In the example below, there are two virtual servers. (www1.maximumlinux.com and www2.maximumlinux.com)

<VirtualHost www1.maximumlinux.com>
ServerAdmin webmaster@maximumlinux.com
DocumentRoot /sites/www1.maximumlinux.com/web
ErrorLog /sites/www1.maximumlinux.com/logs/error_log
TransferLog /sites/www1.maximumlinux.com/logs/access_log
DirectoryIndex index.html index.shtml index.cgi
</VirtualHost>

<VirtualHost www2.maximumlinux.com>
ServerAdmin webmaster@maximumlinux.com
DocumentRoot /sites/www2.maximumlinux.com/web
ErrorLog /sites/www2.maximumlinux.com/logs/error_log
TransferLog /sites/www2.maximumlinux.com/logs/access_log
DirectoryIndex index.html index.shtml index.cgi
</VirtualHost>

If you wanted to attach both of these servers to the same IP address and depend on the web browser to let the server know what site its looking for, you would use the NameVirtualHost directive and add a ServerName directive to each virtual host. In this case, both www1.maximumlinux.com and www2.maximumlinux.com look up as the IP address 206.57.18.204.

NameVirtualHost 206.57.18.204

<VirtualHost www1.maximumlinux.com>
ServerName www1.maximumlinux.com
ServerAdmin webmaster@maximumlinux.com
DocumentRoot /sites/www1.maximumlinux.com/web
ErrorLog /sites/www1.maximumlinux.com/logs/error_log
TransferLog /sites/www1.maximumlinux.com/logs/access_log
DirectoryIndex index.html index.shtml index.cgi
</VirtualHost>

<VirtualHost www2.maximumlinux.com>
ServerName www2.maximumlinux.com
ServerAdmin webmaster@maximumlinux.com
DocumentRoot /sites/www2.maximumlinux.com/web
ErrorLog /sites/www2.maximumlinux.com/logs/error_log
TransferLog /sites/www2.maximumlinux.com/logs/access_log
DirectoryIndex index.html index.shtml index.cgi
</VirtualHost>

Using NameVirtualHost you can add a virtually unlimited number of websites to the same IP address rather than burning one IP address per website. The only possible downside to the NameVirtualHost approach is that very old web browsers don't send the name of the website they are looking for to the server in the headers, so in these cases, those browsers would see the default (or first) website. Generally though, this is a small price to pay for the flexability of a NameVirtualHost server.

Logs:

One of the biggest tricks a webmaster will face is getting some meaningful information from the log files that Apache is constantly spewing. Logfiles can be rotated (stop apache from writing to one access log file and start a new file) on some sort of a basis. For the majority of webservers, rotation on a daily basis is probably going to be perfect, But you can't just move the logfile to anther filename and hope the server makes a new file and starts sticking logs in that. You would need to kill and restart the server for that to work.

But we are in luck here. Since this is quite a universal problem, there is a log rotation script that comes with Apache.

Now that we have the logs rotated on a daily basis, what do we do with them? You are probably going to want to take a look at them to see what has been happening with your website. The easiest thing to do would be to just open that file up and look at it to see what people are asking for, but a much more practical option would be to run the logfiles through some sort of log analisys software, so that is what we will do.

There are many free packages out there. ...

4. getting fancy (dynamic pages)
php
perl <- teaser
cookies

Dynamic pages, or pages that are generated on the fly from a script or program. There are a whole host of ways to make a site dynamic. One of the most popular CGI scripting languages is perl. With it's origins in Unix, perl is uniquly suited to the needs of CGIs because of it's flexibility, but it's principal drawback is it's speed. It takes a relatively long time to launch and execute a perl script against the speed of a compiled C program for instance. But most people don't require absolute blinding speed, and perl's flexibility far outweighs this limitation.

The vast majority of websites out there are static, or built with flat HTML files. The only parts that can be thought of as dynamic are things like feedback forms that run a simple CGI that collects results and possibly emails them to some address. These forms are simple enough, but as demonstrated before with the date script, perl can be used for a whole host of other functions. For instance, you can have a perl script rip through a local text file and build some html on the fly. This would be handy if the text file were the latest product list showing what was in stock and how much things cost. Building a simple shopping cart system based around this idea would be one way to make a dynamic website. Users could have orders calculated on the fly and if you really wanted to get fancy, you could have the perl script subtract items from inventory when they were ordered! But I'll be covering this at a later date.

You have probably heard about web browser cookies in the media more than once. When building a dynamic site, cookies are very usefull. Essentially a cookie is an arbitrary serial number set by the server. (forget everything you heard about invasion of privacy!) As far as a website is concerned, there are just a bunch of web browsers requesting a bunch of pages and images. It doesn't remember what pages it has sent to a particular web browser, and in some cases it can't. (such as if you have an entire office accessing the net through a proxy. All connections would come from the proxy's IP address.) So how is a website to follow a user from page to page? If a web browser were to request a page from a website that uses cookies, the webserver would pick a random number and request that the web browser send that random number back to the server every time it asks for a page. It is up to the web browser to do this or not, but if it does, the website will be able to determine from request to request, which browser it should be building personalized pages for.

...

how do web connections work?
When a web browser wants to get a particular web page, let's say http://www.maximumlinux.com/, it first looks up the IP address of the webserver. In this case, the IP address is 209.246.21.********. Next, it opens up a connection to port 80 (the default web port) on that IP address. When the port opens, the server will respond with a welcome message:

Apache 1.3.9 .....

Next it's the browser's turn to ask for a specific document:

GET / HTTP/1.0

Essentially the web browser isn't specifying a particular html page here, it's just asking for "/". But the webserver knows to serve the file "index.html" as the default document, (set in the httpd.conf) so it spits this file out and closes the connection.

Content-type: text/html

<html>
<head>
<title>MaximumLinux.com</title>

...

</body>
</html>

The web browser renders the page opening new connections for the images it needs.

You can see just how this whole process works by running telnet on port 80. Watch...

Stp~> telnet www.maximumlinux.com 80
Connected to www.maximumlinux.com port 80
Apache 1.3.9 ......

GET / HTTP/1.0 <- you type this followed by 2 returns and you should get all the html kicked back at you!

Content-type: text/html

<html>
...
</html>

notes:
an apache install requires quite a bit of basic unix background, so I intend to spend a bit of time filling in all these background details as I go. I figure most people will be coming at linux with a GUI mindset.

in the intro, I'll discuss the ramifications of running a web server from a dsl or cable modem connection, and what you can and can't do when nameing your server. (dns)

apache usually comes set up on any of the major distributions. In the "good old days" you had to compile everything yourself, so I intend to take the user through the process of downloading and installing the latest version. I'll also go over security in fairly good detail. Apache is a huge package not so nicely coded, but being the most popular web server in the world, it has had quite a few people beating on it, so there is quite a bit to talk about in the security department.

Dynamic pages will be a fairly short section because there are whole other articles in there. (most notably my perl article)