A workaround for a powerpc saltstack bug

From PDP/Grid Wiki
Revision as of 15:58, 15 August 2019 by Dennisvd@nikhef.nl (talk | contribs) (Created page with "With the arrival of the first Power9 systems as data movers in our storage setup we faced the many quirks and glitches that a whole new architecture brings to our configuratio...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

With the arrival of the first Power9 systems as data movers in our storage setup we faced the many quirks and glitches that a whole new architecture brings to our configuration management. While the commit log of the salt/reclass repository gives testimony to that effort, it is necessary to point out an issue that actually lives upstream and has been open and unresolved for a year in spite of being nearly trivial.

There are several things going on here and the journey to find both cause and a decent workaround are worthwhile to share in a public space such as this wiki. The issue is transient as it should eventually be resolved and a software update will bring this to close but for the moment we have to remain aware that there is a sizable workaround in place.

First of all when you look at a Power9 system installed with CentOS7 observe the output of uname -a:

Linux hooikanon-01.nikhef.nl 4.14.0-115.el7a.0.1.ppc64le #1 SMP Sun Nov 25 20:39:39 GMT 2018 ppc64le ppc64le ppc64le GNU/Linux

Whether it is the machine hardware name, processor type or hardware platform, the common theme is 'ppc64le' meaning powerPC, 64 bit, Little-endian.

Now ask the salt system for its grains and it will report this nugget:

[root@hooikanon-01 ~]# salt-call grains.get osarch

Notice that this string is objectively different from what we saw above. A deep dive in the python code reveals that this is derived from the output of

rpm --eval %{_host_cpu}

and already a very questionable choice for a method by which to derive OS information. In this case it is literally asking a package management system what it knows about the CPU, but that is not even true: that particular macro in the RPM packaging system is set at build time (i.e. when the RPM software itself was built) to whatever the output of automake told it. Automake is of course part of that venerable but dated GNU build system that still covers the majority of open source software but is by no means the only one.

So to recap: whatever Automake's idea of the host cpu was on the build machine where the rpm software was compiled now decides what the osarch grain should be on the runtime machine where the rpm command is run.

(At this moment I am not even sure what this grain would be when the machine is installed with Debian.)

The mismatch between the two strings leads to at least one problem when installing packages. Investigation of the code showed how salt needs to treat the architecture of installed packages in a special way on x86_64 systems, because these allow the installation of 32-bit software alongside the 64-bit versions. So a package like libc is different from libc.i386. So when dealing with lists of installed packages, salt will internally tack on the architecture label, but strip it off unless it needs special care (like i386). Guess what it will try to match the rpm architecture with?

It is the osarch grain. And because that does not match the architecture field in the rpm packages, none of the os-dependent software packages get their architectures stripped. The PyYAML packages is reported as PyYAML.ppc64le so when you ask if PyYAML is installed it will tell you no.

If you think that may be innocent, it really isn't because when calling on the pkg.installed state to install packages, first yum is asked to install packages (and yum will say, ok, sure, done) but then this is checked by asking whether the packages are installed and because of the above bug the answer is no except for architecture independent packages.

That is the basis for the pull request that maps powerpc64le onto ppc64le when resolving the package name.

Here we find ourselves in a tricky position. Because the pull request is not currently part of any release of saltstack, we cannot upgrade to solve the issue (yet). And because the fix is in the file /usr/lib/python2.7/site-packages/salt/utils/pkg/rpm.py we cannot override its functionality through the modules extension system of saltstack.

But what we can do is override the yumpkg.py module that ultimately uses the utils/pkg/rpm.py file, by taking the offending code from the latter and sticking it into our own version of the former. With a little bit of rerouting calls in Python we can make sure we use the correct version of resolve_name(). And for the moment this fix lives in /srv/salt/env/.../files/_modules/yumpkg.py. It should be removed once the fix comes out in a later saltstack release.

By the way, there is one more thing wrong with the get_osarch() code; it takes the output of subprocess.Popen() but does not take into account the fact that the output has a newline at the end. So a spurious newline is present at the end of the osarch value and that messes with the mapping in the fix. So our version of the fix has an extra strip() to get rid of the newline.