jump to navigation

Future of OpenSolaris Boot Environment management March 12, 2008

Posted by mgerdts in Uncategorized.
add a comment

I was quite happy to see this recent post from Ethan Quach proposing an efficient method for sharing the variable parts of /var. It bears a striking resemblance to something that I suggested and and clarified in the past.

But why does this matter? When you are making significant changes to the system, such as during a periodic patch cycle or upgrade, it is generally desirable to…

  1. be able to do so without taking the system down for the duration of the process
  2. be able to abort the operation if you have a change of heart
  3. be able to fail back if you realize that newer isn’t better

Consider what is in /var:

  • Mail boxes If the machine is a mail server (using sendmail et. al.) there is a pretty good chance that users have their active mail boxes at /var/mail.
  • In flight mail messages Most machines process some email. For example, if a cron job generates output it is sent to the user via email. Many non-web mail clients invoke /bin/mail or /usr/lib/sendmail to cause mail to be sent. Each message spends somewhere between a few milliseconds and a few days in /var/spool/mqueue or /var/spool/clientmqueue.
  • Print jobs If the machine acts as a print server (even for a directly attached printer) each print job spends a bit of time in /var/spool/lp.
  • Logs When something goes wrong, it is often times useful to look in log messages to figure out why it went wrong. Those are often found under /var/adm.
  • Temporary files that may not be It is rather common for people to stick stuff in /var/tmp and expect to be able to find it sometime in the future.
  • DHCP If a machine is a dhcp server, it will store configuration and/or state information in /var/dhcp.

All of those things should be of a stable file format and usable before and after you patch/upgrade/whatever. If you take the traditional Live Upgrade approach, you can patch or upgrade to an alternate boot environment. As part of activating the new environment, a bunch of files are copied between boot environments. According to this page the following things are synchronized:

/var/mail                    OVERWRITE
/var/spool/mqueue            OVERWRITE
/var/spool/cron/crontabs     OVERWRITE
/var/dhcp                    OVERWRITE
/etc/passwd                  OVERWRITE
/etc/shadow                  OVERWRITE
/etc/opasswd                 OVERWRITE
/etc/oshadow                 OVERWRITE
/etc/group                   OVERWRITE
/etc/pwhist                  OVERWRITE
/etc/default/passwd          OVERWRITE
/etc/dfs                     OVERWRITE
/var/log/syslog              APPEND
/var/adm/messages            APPEND

Notice that the default configuration loses your in flight print jobs because /var/spool/lp is not copied. Suppose you have a mail server with a few gigs of mail at /var/mail. Is it a good use of time or disk space to copy /var/mail between boot environments?

A much better solution seems to be to make those directories shared between the boot
environments. The way to do this in Live Upgrade and presumably in the future is to remove (or not add) them to /etc/lu/synclist and allocate separate file systems. However, do you really want a file system for /, /var/mail, /var/spool/mqueue, /var/spool/clientmqueue, /var/spool/lp, /var/adm, /var/tmp, /var/dhcp, …? What if you had someone tell you that you had to monitor every file system on every machine for being out of space? How big would you make all of those file systems so that your monitoring didn’t wake you up in the middle of the night?

In the future, it looks as though OpenSolaris will use ZFS to store each boot environment. Among the features of ZFS that make this desirable are snapshots, clones, and rethinking the boundary between disk slices (or volumes) and file systems. If the organization of /var is changed just a bit…

/var/adm -> share/adm
/var/dhcp -> share/dhcp
/var/mail -> share/mail
/var/spool -> share/spool
/var/tmp -> share/tmp
/var/share/adm
/var/share/dhcp
/var/share/mail
/var/share/spool
/var/share/tmp

Then you can get by with having two zfs file systems: / and /var/share. The Snap Upgrade process would then likely do the following:

  1. Take a snapshot of /, clone it, then mount it somewhere usable in subsequent steps (e.g. /mnt/abe)
  2. Do whatever is needed on the alternate boot environment mounted at /mnt/abe.
  3. Unmount the alternate boot environment

When it comes time to activate the new boot environment, there are some files that are likely need to be synchronized using the traditional mechanism. For instance, if someone tried to get into a system by guessing a user’s password, there is a reasonable chance that the account was locked via a modification to /etc/shadow. Presumably you don’t want to give the bad guy another chance when you activate the new boot environment. Note, however that the files that may need to be synchronized in /etc are nearly always small files and there would not be very many of them. The files in /var/shared would not need to be synchronized. However, just in case the new version of sendmail decides to eat mailboxes, it would be very nice to be able to recover.

This means that activating a boot environment would look like:

  1. Bring the system into single-user mode
  2. Mount the alternate boot environment
  3. Synchronize those files that need to be synchronized
  4. Take a snapshot of /var/shared
  5. Set the boot loader to boot from the new boot environment and offer a failback option to the old boot environment
  6. Reboot

The items in italics are special to boot environment activation. Each one should take a couple seconds or less – adding far less than thirty seconds to the normal reboot process to activate the new boot environment. Failback would be similarly quick.

Now suppose this system is a bit more complicated and has 20 zones on it. Have you ever patched a system with 20 zones on it? Did you start and Friday and finish on Monday? How happy were the users with the “must install in single-user mode” requirement? This same technique should allow you to have two file systems per non-global zone – one for the zone root and one for /var/shared in the zone. Supposing that the reboot processing takes 5 seconds per zone you are looking at an extra minute to reboot rather than a weekend of down time.

Without Live Upgrade or Snap Upgrade, what would backout look like? After you had the system down for patching for a couple days, you could take it down again for a couple days to back the patches out. Or you could go to tape. Neither is an attractive option. With Snap Upgrade you should be able to fail back with your normal reboot time plus a minute.

reproducible hang with ldom preview April 12, 2007

Posted by mgerdts in Uncategorized.
add a comment

This blog entry is because formatting is horribly broken at http://forum.java.sun.com/thread.jspa?threadID=5159568 where I originally posted it.

I configured a T2000 as described in the beginner’s guide (http://www.sun.com/blueprints/0207/820-0832.pdf) with the exception of the device allocated for the root disk. For that I came up with my own variant of http://unixconsole.blogspot.com/2007/04/time-to-build-guest-domain.html.

My variant of using a file involved creating the file with mkfile on a zfs file system. That is…

zpool create zfs mirror c1t0d0s4 c1t1d0s4
zfs create zfs/ldoms
zfs set compress=on zfs/ldoms
mkfile 32G /zfs/ldoms/root.img

As I install Solaris in the ldom, the server (control domain) dies after extracting a few hundred megabytes of a flash archive. I have traced this down to it running out of memory.

Here’s “vmstat 4″ output on the control domain console:

...
0 0 0 8636536 24776  1  70  0  0  0  0  0  0  0  0  0 1954  311 2244  0 23 76
0 0 0 8645096 22280  1  36  0  0  0  0  0  0  0  0  0 5957  370 9827  0 19 80
0 0 0 8649008 17720  1  40  0 296 313 0 44 0  0  0  0 9975  361 15877 0 24 75
0 0 0 8651104 15944  1  52  0 807 1671 0 700 0 0 0  0 10725 347 17545 0 26 74
0 0 0 8650800 18376  0  60  0 88 239 0 127 0  0  0  0 9816  391 15545 1 33 67
0 0 0 8640432 15936  0  76  0 497 3025 0 3874 0 0 0 0 11367 432 17975 0 35 65
0 0 0 8642968 17032  1  59  0 452 2028 0 842 0 0 0  0 10266 363 16127 0 27 73
kthr      memory            page            disk          faults      cpu
r b w   swap  free  re  mf pi po fr de sr m0 m1 m2 m1   in   sy   cs us sy id
0 0 0 8644768 15744  0  56  0 387 1298 0 126 0 0 0  0 10170 330 16355 0 24 75
0 0 0 8652504 18368  1 113  0 372 2462 0 273 0 0 0  0 11171 321 18613 0 35 65
0 0 0 8652832 15720  1 134  0 411 6081 0 738 0 0 0  0 11541 332 18979 0 34 66
0 0 0 8652232 14312  1  94  0 413 1806 0 7775 0 0 0 0 10718 358 18271 0 38 62
0 0 0 8647360 12592 18 133  9 555 5176 0 17490 1 0 1 0 10394 320 16970 1 37 63
0 0 0 8645248 14408  2  73 22 486 5039 0 3111 2 1 1 0 11749 383 18336 0 40 59
2 0 43 8641800 2784  1 148 99 1070 1517 0 53982 19 9 9 5 8316 356 14226 0 43 57
0 0 116 8647032 800  1  42 127 134 312 3688 76207 14 7 7 1 2153 114 3726 0 29 71

At this point the server froze. Note that 116 processes were swapped and the “de” column is 3688. Very bad news.

My initial thoughts were that I was running into some of the low-memory problems known to happen with the ZFS arc. This does not seem to be the case. According to mdb, the arc size is around 60 MB:

# mdb unix.3 vmcore.3
...
> arc::print -td size
uint64_t size = 0t61455360

The control domain is S10 11/06 + 118833-36 + those required for ldoms + many others. The ldom is in the process of being installed is booted from a S10 11/06 netinstall image (118833-33).

Random password one-liner September 1, 2006

Posted by mgerdts in Uncategorized.
2 comments

I recently came up with this method for generating reasonable random 8-character passwords:

$ dd if=/dev/random bs=6 count=1 2>/dev/null | openssl base64
LCia46S4

If 8 characters is not long enough, increase the number after bs= to 75% of the number of characters you would like in the password.

Install Solaris from DVD image on disk August 27, 2006

Posted by mgerdts in Uncategorized.
add a comment

My personal SPARC machine is pathetic by today’s standards – An Ultra II with a pair of 300 MHz processors, 768 MB RAM, and a very slow CDROM drive. This is pretty much the slowest machine that is supported by Solaris 10. That, and today I decided it was time to get a fresh installation of Solaris Express (build 46) on it.

I first tried the live upgrade route. However, that didn’t work out too well because I had previously used bfu to get some newer OpenSolaris bits on the machine. I really did not want to repeat the download process for all the CD ISO’s (already had downloaded the DVD ISO). Now, if you think that downloading and burning is slow – you should see the speed of the installation on this CDROM drive. It was probably OK in the days when Solaris fit on one CD, but not today with 5(?) CD’s to complete the installation.

The disk layout of the machine was as follows:

  • c0t0d0 32 GB disk
    • c0t0d0s0 – 4.5 GB available for new /
    • c0t0d0s1 – ~500 MB swap
    • c0t0d0s7 – remainder as zfs pool “pool0″
  • c0t1d0 4 GB disk
    • c0t1d0s0 – Root with build 36 (?) + random BFU bits

I had the DVD image in a subdirectory of my home directory that was in the pool0/home file system in the zfs pool.

To make use of that DVD image without buying a SCSI DVD drive, I did the following:

  1. Burn build 46 CD0 to a CD-R
  2. Boot from the CD-R
  3. Go clean up the shop from the woodworking I was doing earlier
  4. Do some laundry
  5. Return to the Ultra II to find that it was just about to ask me which language I speak. Really, it was still working on it. Now do you know why I didn’t want to feed it 5 CD’s?
  6. Answer sysidcfg questions
  7. Exit the installer
  8. zpool import pool0. After the import was complete but before mounting file systems, zpool crashed with a segv. Later I saved that core file to /a for later analysis
  9. zfs set mountpoint=/tmp/home pool0/home
  10. zfs mount pool0/home
  11. lofiadm -a /tmp/home/build46.iso
  12. umount /cdrom
  13. mount -F hsfs -o ro /dev/lofi/1 /cdrom
  14. install-solaris
  15. Go blog about a cool hack. :)

The installation is now about 40% done. Looks like the hack is working just fine. I wonder if I could bundle this all up in a begin script (especially the laundry) to automate the installation from an ISO image after booting from local media.

Update on zoneadm create with zfs March 1, 2006

Posted by mgerdts in Uncategorized.
1 comment so far

After reaching out to Sun to work on getting my work integrated into OpenSolaris, I found that Sun was already working on this feature. Subsequently, they indicated that the code made it into some internal source code tree. As such, I am holding off on future development until I can get at that code.

However, if you are wanting to try it out, I have posted the code for others to play with. If you have a working OpenSolaris build environment, you should be able to drop in my modified zoneadm.c, run dmake all, then use the resulting zoneadm command. Alternatively, the sparc version of the zoneadm binary is also available.

Enjoy!

T:

Zone created in 0.922 seconds February 20, 2006

Posted by mgerdts in Uncategorized.
add a comment

I noticed today that in the latest OpenSolaris code that “zoneadm clone” exists. Unfortunately, cloning a zone only offered the copy mechanism that was essentially “find | cpio”. A bit of hacking later and we have this:

# time ksh -x /var/tmp/clone
+ newzone=fast
+ template=template
+ zoneadm=/ws/usr/src/cmd/zoneadm/zoneadm
+ PATH=/usr/bin:/usr/sbin
+ zonecfg -z fast create -t template
+ zonecfg -z fast set zonepath=/zones/fast
+ /ws/usr/src/cmd/zoneadm/zoneadm -z fast clone -m zfsclone template
Cloning zonepath /zones/template...

real    0m0.922s
user    0m0.128s
sys     0m0.171s

This comes is achieved using zfs to create a snapshot of the template zone, then clone the snapshot to create the zonepath of the new zone. A bit of cleanup is needed, but goodness is on the way.

T:

hunting bugs in filebench February 16, 2006

Posted by mgerdts in Uncategorized.
add a comment

I’ve been using filebench a bit at work and decided that I would like to try a few things out at home. My home machine is not quite as beefy as the V40z’s that I have been testing on at work.

Getting filebench to compile in the first place is a bit of work. Probably works really well on someone else’s system, but mine is obviously different. That’s another story though. After compiling filebench, I ran it for the first time and saw this:

$ /opt/filebench/bin/filebench
Segmentation fault (core dumped)

Bummer. Well, let’s see where that is at:

$ gdb /opt/filebench/bin/filebench core
GNU gdb 6.4-debian

. . .

(gdb) where
#0  0x37dd84aa in memset () from /lib/tls/i686/cmov/libc.so.6
#1  0x0807b01e in ?? ()
#2  0x080522da in ipc_init () at ipc.c:264
#3  0x08058bc1 in main (argc=1, argv=0x3f8fdcf4) at parser_gram.y:1140

OK, so let’s go with the assumption that the bug is in the code listed as alpha on the web site, and not libc. So we go up the stack a couple levels.

(gdb) up 2
#2  0x080522da in ipc_init () at ipc.c:264
264             memset(filebench_shm, 0, c2 - c1);
(gdb) print filebench_shm
$1 = (filebench_shm_t *) 0xffffffff

Hmmm… 0x with a bunch of f’s looks like -1. Perhaps some system call on Solaris (presumably where filebench started) returns NULL on error and on Linux it returns -1. Let’s go looking for that system call.

(gdb) list
259     #endif /* USE_PROCESS_MODEL */
260
261             c1 = (caddr_t)filebench_shm;

262             c2 = (caddr_t)&filebench_shm->marker;
263
264             memset(filebench_shm, 0, c2 - c1);
265             filebench_shm->epoch = gethrtime();
266             filebench_shm->debug_level = 2;
267             filebench_shm->string_ptr = &filebench_shm->strings[0];
268             filebench_shm->shm_ptr = (char *)filebench_shm->shm_addr;

Nope, not there. Maybe a bit further up.

(gdb) list 250
245     #endif
246
247             if ((filebench_shm = (filebench_shm_t *)mmap(0, sizeof(filebench_shm_t),
248                     PROT_READ | PROT_WRITE,
249                     MAP_SHARED, shmfd, 0)) == NULL) {
250                     filebench_log(LOG_FATAL, "Cannot mmap shm");
251                     exit(1);
252             }
253
254     #else

It looks like mmap may be the culprit. I first asked man, but this is Linux, not Solaris. No man page for mmap! Next try google. Google comes up with this page that looks a lot like a man page. Why isn’t that found on my system? Another thing for another day. Anyway, it says:

RETURN VALUE

On success, mmap returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. On success, munmap returns 0, on failure -1, and errno is set (probably to EINVAL).

Ok, so it is returning -1 because it doesn’t like something. Let’s see what it is trying to mmap:

(gdb) print sizeof(filebench_shm_t)
$2 = 907368000
(gdb) print sizeof(filebench_shm_t) / 1024 / 1024

$3 = 865
(gdb)

That ’splains it. It looks like it is trying to set up a shared memory segment that is 865 MB. My poor little system only has 512.

FWIW, I have created a patch that addresses this one problem but I haven’t had a chance to test it on Solaris yet. Unfortunately, with the patch, it just tells me that the mmap failed. It doesn’t address the fact that it is trying to allocate a shared memory segment larger than the size of RAM on my system.

Update 1:

I have posted several patches to the bug tracking system at sourceforge.net. This particular one is 1432638. It turns out that mmap on Solaris also returns MAP_FAILED so the patch is simpler than I originally expected.

T:

Download and gunzip in one step February 1, 2006

Posted by mgerdts in Uncategorized.
Tags: ,
add a comment

I was feeling the need to take a look at Nexenta and decided that I wasn’t terribly interested in waiting for a download, then waiting for a gunzip. Why not do them both at the same time?

$ wget -O /dev/stdout \
	http://www.gnusolaris.org/gsmirror/genunix.org/elatte_installcd_alpha2_i386.iso.gz \
	| gunzip > elatte_installcd_alpha2_i386.iso => '/dev/stdout'
--20:21:47--  http://www.gnusolaris.org/gsmirror/genunix.org/elatte_installcd_alpha2_i386.iso.gz
Resolving www.gnusolaris.org... 216.129.112.21
Connecting to www.gnusolaris.org|216.129.112.21|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.genunix.org/distributions/gnusolaris/elatte_installcd_alpha2_i386.iso.gz [following]
--20:21:48--  http://www.genunix.org/distributions/gnusolaris/elatte_installcd_alpha2_i386.iso.gz
Resolving www.genunix.org... 204.152.191.100
Connecting to www.genunix.org|204.152.191.100|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 567,433,011 (541M) [text/plain]

13% [====>                                ] 77,025,880   359.25K/s    ETA 22:27

Just 22 minutes to go. I guess at this rate I could have piped it through cdrecord with “speed=2″.

patch_order made easy January 14, 2006

Posted by mgerdts in Uncategorized.
1 comment so far

Some of my most tedious times as a Solaris administrator have been when I needed to create a patch_order file for a custom patch cluster. For a long time I have intended to just write a script…

But now, I don’t have to do that any longer! Today I discovered that smpatch(1M) now has an order subcommand. This makes it really quite simple for me to create a patch_order file for a very long list of patches. In this example, I create the patch_order file for the patches in the Solaris 10 Update 1 UpgradePatches directory:

# cd /mnt/Solaris_10/UpgradePatches
# ls > /tmp/patches
# smpatch order -d `pwd` -x idlist=/tmp/patches > /tmp/patch_order

Now, if you want to go the full length and create a patch cluster for it:

# mkdir /tmp/10U1_UpgradePatches
# cd /tmp/10U1_UpgradePatches
# mv /tmp/patch_order .
# ln -s /mnt/Solaris_10/UpgradePatches/* .
# cp /somewhere/10_Recommended/install_cluster .

Modify the SUPPLEMENT_NAME=”…” line in install_cluster to be more descriptive for this patch cluster. Be sure to not use characters like /, \, |, etc.

# cd /tmp
# zip -rq 10U1_UpgradePatches.zip 10U1_UpgradePatches

At this point, you can copy the 10U1_UpgradePatches around to your various machines and use it just like you would a 10_Recommended bundle.

Enjoy!

May 12, 2005

Posted by mgerdts in Uncategorized.
5 comments

I am sad to report that the world is running short of geeks. I think it was the 1200 baud modem in my 386sx system that I smoked will building Linux 0.99pl14 that solidified my position in the 90’s.


My computer geek score is greater than 96% of all people in the world! How do you compare? Click here to find out!