Auth via Active Directory stops working on some servers

We’ve got a couple dozen SLES servers that we configured to authenticate with our Active Directory domain via the “Windows Domain Membership” tool in Yast. We primarily use it for SSH login, and we also have a line in our sudoers file allowing authorization via an AD group.

Lately, some servers have suddenly lost the ability to authenticate via AD. So far I’ve been unsuccessful in finding a common link defining why these servers in particular were affected.

[LIST]
[]The first affected server is SLES12 SP1, and it happened some weeks ago. I was busy at the time and can no longer remember what was done on the server around the time it happened.
[
]The second server happened in the first step of a SLES12SP1->SP2->SP3 upgrade (so somewhere while I was upgrading between SP1 and SP2). It is now SP3 and still broken.
[*]The last server (so far) happened while I was preparing to upgrade SP1->SP2->SP3. The root partition was nearly full, so I extended the drive in VMWare, booted the VM to a SLES12 SP2 ISO, went into rescue mode, and extended the partition and filesystem using fdisk and resize2fs. After doing just that, not actually making any changes at the OS level, AD auth broke. I stopped there, and the server is still SLES12 SP1.
[/LIST]

The rest of my servers (mostly SLES11, but some SLES12 SP1 and SP3) are working just fine. And everything else on the affected servers appears to be working fine. In fact, applications on the affected servers can themselves successfully use AD authentication (via their own means).

I’ve turned on debugging in pam_winbind.conf and compared the logs. Both successful and unsuccessful attempts start the same way:

2018-05-23T08:37:19.621649-04:00 server01 sshd[20783]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=172.16.xx.xx user=domain\\username 2018-05-23T08:37:19.622328-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): [pamh: 0x55c99b087a10] ENTER: pam_sm_authenticate (flags: 0x0001) 2018-05-23T08:37:19.622694-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): getting password (0x000000d1) 2018-05-23T08:37:19.623049-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): pam_get_item returned a password 2018-05-23T08:37:19.623381-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): Verify user 'domain\\username' 2018-05-23T08:37:19.648066-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): CONFIG file: require_membership_of 'LinuxServerUsers' 2018-05-23T08:37:19.648505-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): enabling krb5 login flag 2018-05-23T08:37:19.648876-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): no sid given, looking up: LinuxServerUsers

Then they diverge. A successful login looks like:

2018-05-23T09:02:27.658635-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): request wbcLogonUser succeeded 2018-05-23T09:02:27.658967-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): user 'domain\\username' granted access 2018-05-23T09:02:27.659244-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): Returned user was 'DOMAIN\\username' 2018-05-23T09:02:27.659586-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): [pamh: 0x55895157bc20] LEAVE: pam_sm_authenticate returning 0 (PAM_SUCCESS) 2018-05-23T09:02:27.659912-04:00 server02 sshd[12842]: pam_winbind(sshd:account): [pamh: 0x55895157bc20] ENTER: pam_sm_acct_mgmt (flags: 0x0000) 2018-05-23T09:02:27.667852-04:00 server02 sshd[12842]: pam_winbind(sshd:account): user 'DOMAIN\\username' granted access 2018-05-23T09:02:27.668264-04:00 server02 sshd[12842]: pam_winbind(sshd:account): [pamh: 0x55895157bc20] LEAVE: pam_sm_acct_mgmt returning 0 (PAM_SUCCESS)

While on one of the affected servers, it continues:

2018-05-23T08:37:19.726146-04:00 kmicontract01 sshd[20783]: pam_winbind(sshd:auth): request wbcLogonUser failed: WBC_ERR_AUTH_ERROR, PAM error: PAM_AUTH_ERR (7), NTSTATUS: NT_STATUS_LOGON_FAILURE, Error message was: Logon failure 2018-05-23T08:37:19.726621-04:00 kmicontract01 sshd[20783]: pam_winbind(sshd:auth): user 'domain\\username' denied access (incorrect password or invalid membership) 2018-05-23T08:37:19.726993-04:00 kmicontract01 sshd[20783]: pam_winbind(sshd:auth): [pamh: 0x55c99b087a10] LEAVE: pam_sm_authenticate returning 7 (PAM_AUTH_ERR) 2018-05-23T08:37:21.738222-04:00 kmicontract01 sshd[20781]: error: PAM: Authentication failure for domain\\\\username from 172.16.xx.xx

I’m using the same user, which obviously has the same group memberships in the domain in both cases. I’m at a loss for where to go from here for troubleshooting. As far as I can tell these attempts are not reaching a domain controller, as my user has not been locked in AD as it should be after 5 failed login attempts. Is there a way I can tell what domain controller a server is trying to authenticate with?

Any other suggestions? This one has me really scratching my head. Here are a couple of config files for reference (same on working and non-working servers):

/etc/security/pam_winbind.conf (commented sections removed)

[global] cached_login = no krb5_auth = yes krb5_ccache_type = require_membership_of = LinuxServerUsers debug = yes

/etc/samba/smb.conf

[global] workgroup = DOMAIN passdb backend = tdbsam printing = cups printcap name = cups printcap cache time = 750 cups options = raw map to guest = Bad User include = /etc/samba/dhcp.conf logon path = \\\\%L\\profiles\\.msprofile logon home = \\\\%L\\%U\\.9xprofile logon drive = P: usershare allow guests = No idmap gid = 10000-20000 idmap uid = 10000-20000 kerberos method = secrets and keytab realm = DOMAIN.COM security = ADS template homedir = /home/%D/%U template shell = /bin/bash #winbind offline logon = yes #winbind refresh tickets = yes [homes] comment = Home Directories valid users = %S, %D%w%S browseable = No read only = No inherit acls = Yes [profiles] comment = Network Profiles Service path = %H read only = No store dos attributes = Yes create mask = 0600 directory mask = 0700 [users] comment = All users path = /home read only = No inherit acls = Yes veto files = /aquota.user/groups/shares/ [groups] comment = All groups path = /home/groups read only = No inherit acls = Yes [printers] comment = All Printers path = /var/tmp printable = Yes create mask = 0600 browseable = No [print$] comment = Printer Drivers path = /var/lib/samba/drivers write list = @ntadmin root force group = ntadmin create mask = 0664 directory mask = 0775

I should mention that I tried deleting the server’s computer object in the domain and rerunning the Windows Domain Membership tool. This succeeded and recreated the server object in the domain, but it didn’t fix the auth issue.

The following works fine, even on a “bad” server:

badserver01:~ # getent passwd domain\\\\username DOMAIN\\username:*:10000:10000:User Name:/home/DOMAIN/username:/bin/bash badserver01:~ # getent group domain\\\\LinuxServerUsers DOMAIN\\linuxserverusers:x:10009: badserver01:~ # getent initgroups domain\\\\username domain\\username 10000 10245 10002 10003 10004 10005 10006 10007 10254 10008 10009 10246 10247 10253 10248 10249 10256 10010 10250 10011 10012 10013 10014 10015 10251 10016 10017 10018 10019 10020 10255 10021 10022 10023 10024 10025 10026 10027 10028 10029 10030 10031 10032 10033 10034 10036 10037 10038 10039 10040 10244 10047 10041 10043 10044 10045 10046

/etc/nsswitch.conf:

[CODE]passwd: compat winbind
group: compat winbind

hosts: files dns
networks: files dns

services: files
protocols: files
rpc: files
ethers: files
netmasks: files
netgroup: files nis
publickey: files

bootparams: files
automount: files nis
aliases: files[/CODE]

So it looks like getent uses winbind, which is working just fine.

File this one under “makes no sense.” I found one other correlation between these servers, which is that they’re all on one particular host in our virtual environment. I migrated one to another host and could immediately use AD credentials on SSH again. Ok, a network issue, right? Well, after migrating the VM back to the “bad” host, I could still use AD credentials. Huh? After rebooting the VM (on the “bad” host), the AD auth breaks again. So this issue only manifests on servers that were on a particular host at boot time, and it “goes away” immediately until next reboot when the VM is migrated to any other host.

I can’t say this has exactly been solved, but the issue is now probably out of scope here.