Friday 1 February 2019

Automation: starting small


Automation in IT is the future, but the future is already here for some organisations.  However, this is not the case for all organisations.  Lots of enterprise level organizations are still working in a traditional way and using traditional enterprise systems (Unix, Oracle, ...).  So, as a simple engineer, how can you guide them to the automation era?

For 17 years now, I have been working on Unix systems and for the last 7 years I'm also using linux systems.  This was always in big organisations like government institutions and banks.  Organisations of this magnitude evolve very slowly.  It takes years before they even decide what would be the next hardware to use to replace this old HPUX system, just to replace it with ... a new HPUX system.  I also worked on Solaris and I remember a company buying 14 extra SUN blades (Solaris) just because there was still room in the budget.  The years to come, we could make all new applications land on these extra blades.

How would you get these enterprise environments to take the giant leap to the cloud?  Well, you hope the CIO has seen this in action and wants to go into this direction, but still, it would take 5 to 10 years to change the minds of the decision makers...  As an engineer you are not able to make a difference at this level ... or most engineers won't.  I will come back to this in another blog post.

Start Small

So, your enterprise company is using Unix systems, Linux systems, Windows systems, traditional network parts (e.g. F5), can you automate this?  Yes you can!!!

Forget about the big picture, that looks too complicated since these systems are not really fancy.  But, why would that stop you?  For this, you start thinking small.  What small tasks do you do on these systems daily:
- create a volume
- extend a volume
- create a user
- stop a database
- stop a cluster group
- ...

the list can go on and on, but don't let that stop you.  Now that you know what your targets are, go ahead and automate them one by one.  On Linux and Unix, make shell scripts or if you are a real programmer (and you will in the future if you are reading this post) start using python, perl, ... On windows, use powershell.  The next time you need one of these tasks, use the script you created.  Stop doing these things manually.  If the script fails, fix it and try again.  After a while, it will be nearly perfect.

Central orchestrator

Next step is to use a centralised orchestrator.  You can use Jenkins (my favorite), but ansible is very popular among system engineers.  Whatever orchestrator you use, make sure you can make small entities (like jobs in jenkins) that can handle your small task.  All they need to do is trigger your scripts with the correct parameters.

Now that you have linked the orchestrator to your script, start using that instead of the script.  Keep going until it is working perfectly.  Never use the script again.

Create flows

Ok, now we have a bunch of tasks all centralised in an orchestrator tool.  This is were the magic happens.  You can link these tasks together and make flows that handle 5 or 50 steps.  Now you can start looking at the big picture.  Make sure you make a flow chart, you will see that you forgot something if you did.  Once that is ready, implement the flow in the orchestrator and you're done.

What we have automated in the past is a complete Site DRP test with HPUX and Linux systems running lots of databases and applications.  One push of a button!  Impressive right ;-).

The same approach can be used for network hardware or even to tooling that is accessible trough Rest APIs.  Every action becomes a different action in your orchestration tool.

Conclusion

So, don't let legacy systems scare you off, give it your best and automate the damn thing ;-).

If you have questions, you can contact me.


Saturday 19 March 2016

Security and developpers

I have always wondered, why is security so hard for developpers?

I'm a system engineer and i've worked in some highly secured environments for banks.  I've seen my share of badly secured software, not to mention badly written software in general.  Sometimes, the problems are in the design. Then the problem is not in the code, so whatever the developpers did after the design was made is irrelevant.  Sometimes it's just lazyness of the developper.  Why going though the trouble of finding a secure way to handle passwords if you can use them in plain text.  But sometimes, the problem is introduced right at the end, while writing the installation manual.  The last one is just sad!

Recently I was asked to install an application that gathers all configuration information from every server, database, ... and sends it to a centralized server.  That server would use the information gathered to do consistency checks.  Great product and it would save us a lot of problems in the long run.  The communication between agents and master will be encrypted using SSL.  Great, right?

The installation procedure stated that the private key of the server had to be copied on every agent (about 1000 servers).  I don't understand this way of thinking. Why is it called "private" if you copy it to every server? In this case, if 1 agent gets compromised, that agent can impersonate itself as the master and gather the info from all the serverpark, being a big security risk.

Ok, I do understand that you decide not to create 1000 private keys...  But maybe you can make 2? 1 for the master and 1 for every agent.  This way, if 1 agent gets compromised, only bogus info can be sent to the master, no info can be stolen...  Off course, we deviated from the installation manual to fix this issue...

Why is it so hard to ask an experienced system/security engineer to review the software before distributing it to customers?








Friday 18 March 2016

Automatisation is the future

After 15 years working in IT as a system engineer, I've seen the evolution from the good old days to off-shoring and near-shoring.

Cost ...

it is the most important word in the IT business.  That is what we are for a business and the goal is to reduce the cost...  In the good old days, money was no issue.  I worked for a bank where an investment of 2 million euros was made solely based on the discount that the supplier was giving...  Did we really need hardware of 2 million euros?  Not in the least, the biggest part of the hardware was just collecting dust in the years to come...
During that time, the cost that is called: 1 FTE was no issue.  The only problem companies had was finding the right people, and once found, the numbers were not important...

One day, the idea grew that there were people everywhere in the world and that there are computers everywhere in the world, so statistically speaking, there are some people everywhere who can use a computer...  From that idea, off-shoring was born.  There were some problems like the language.  Try to understand some Indian guy in the middle of the night spelling some flemish URL ... hell!
Some other problems were "understanding the customer".  Some HP guy once told me that he got a call on a Saturday morning from India telling him that the website from Mr Ing was down...  ING is one of the biggest international banks in Belgium.  They thought some tennisclub website was down or something.  In reality, the whole ING bank was down!
Another important problem is knowledge.  I can understand that Indian guys are also very smart in computer stuff, but not all of them...  Companies think they can just pick someone from the streets of India, asking them to spell Linux and if they are able to, just give them a contract.  So, a lot of these guys have no idea what they are doing...
The list of problems just goes on like the time difference, other customs, ...

Ok, the solution to this was: near-shoring.  Don't look too far, maybe some of the problems will disappear if we are closer to home ... they didn't ...

Now, for the first time in 10 years, I see an evolution in the opposite way.  I also see opportunities.  Tools like jenkins, puppet, ... they give us the opportunity to drastically increase our efficiency.  I think we need to focus on showing the IT world that we can have a big advantage over *-shoring because, we have the knowledge and the expertise to create a new way of working...  If my sense of the future is correct, I think this will save our jobs.

But, as always, time will tell.

Saturday 21 September 2013

What is WIO in Sar reports

When analysing sar data, one of the parameters that causes lots of confusion is the wio (or wait for IO) value.  It is counted in CPU reports as part of the CPU time, but is it really used CPU time?

If you look at the picture below you can see that even KSAR (http://sourceforge.net/projects/ksar/) makes it red in his graphs:


If you look closely, you can see that the red part is bigger at times when the blue and green parts are smaller.  This is a first indication that the red part is not literally a problem.

When a process has to wait for IO, it is shifted of the CPU.  If the CPU is able to wait (if it has some CPU time to spare), it will wait for the IO, so it can be handled immediately.  The time the CPU lets the process wait for IO, is the red part.

However, if the CPU has no time, and is more in use, it will shift the process to wait in a queue.  At that point, the CPU is able to serve other processes while the first process waits.

So, we can say that the WIO is significant, but it is not real "busy" time.

SCHED_NOAGE CPU problems with Oracle log writer on HPUX

One of Oracle's recommendations running Oracle on HPUX is using the SCHED_NOAGE CPU scheduler:

http://docs.oracle.com/cd/B28359_01/server.111/b32009/appb_hpux.htm

This is instead of the standard HPUX CPU scheduler which will change the priority of processes during their lifetime.  One of the reasons for this recommendation is to avoid  Cursor: pin-s problems when CPU peaks are common:

http://srivenukadiyala.wordpress.com/2012/01/30/sched_noage-and-latch-contention/

By changing the CPU scheduling policy, we introduced another problem.  One of the charactiristics of SCHED_NOAGE is that every process is started with the same priority.  Default, oracle recommends to use 178, which is the highest priority (the lower the number, the higher the priority) you can give to a process in SCHED_NOAGE.  The lower numbers are reserved for other scheduling policies (like realtime scheduling).  

The problem we have encountered is that at high loads, the log writer has to wait for other processes to complete.  Processes that are waiting for a commit must wait for the log writer to recieve CPU time.  This can cause heavy delays on the applications waiting for these commits.  

To solve this, we changed the priority of the log writer process (lgwr_DBNAME) so it would have priority over all the other processes.  

This limited the commit time trimendously.  Application performance boosted.  

The ideal way to give more CPU time to the log writer process, is to change the default priority of all the Oracle processes to for example 180.  After the DB is started, the priority of the log writer is changed to 178. 

The only problem is that Oracle support recommended against it, but was unable to supply an alternative.