Many IBM TM1 and PA systems have long-running processes, process trains and chores that can accidentally “step” on one another and cause undesired process or chore failure, rollbacks and thus consequent problems that can include partial data loads, failed data loads, & dimensional structure problems.
All of which result in stress to administrators, upset by users and lack of confidence in a system.
There are effective solutions to this state of affairs. Ranging from fixing underlying inefficiencies to pausing the execution of chores dynamically.
A Common Problem
It is not uncommon to find TM1 / PA systems that have chores or processes that over the years, or suddenly, have increased in run time due to things such as:
- Increases in data volume, which can be intermittent
- Increases the complexity of business operations, e.g.
- New subsidiaries
- New methods of work
- Unfortunate system changes made without complete understanding of implications
- Calendar time-dependent changes in run times of processes
- Heavier user use
- Poor design decisions made long ago
Improper use of TM1 which can include:
- Specific dangerous commands, e.g. SaveDataAll
- Overuse of commands like: SecurityRefresh, CubeProcessPeeders
- Improper frequency of chores/ chore actions
- Inefficient or old process design, not taking advantage of more modern commands, better practice.
- Poor or obsolescent design /implementation decisions
Unfortunately, the practical downstream effect is to cause chores to “step” on one another. That is, a second chore can start before the first is finished. Depending on the process content this can cause effects like slow processing, process failure, process locking, slow processes, users locking, slow response to user requests, rollbacks of processes, and incorrect results.
All of which can cause users and administrators to be frustrated with poor performance or data issues.
While the details of what might have happened when things go wrong can be tediously complicated to map out, and vary widely from system to system, most TM1/PA administrators/ senior users can look at the logs and run times and quickly divine the source of the possible issues.
Possible Solutions to Chore Interference:
The challenge when process and chores have problems like this is to determine exactly what to do.
Change the chore timing.
In some cases, the problem is remedied by simply giving the chore more time to run, and/or moving the next chore back enough time to allow the first to complete.
This is normally sufficient in a system where the problem is known “we acquired a new subsidiary and have twice the data volume”, “beginning of month transaction volume explodes” and specifically where run times and chore timing is loose enough to allow the subsequent chores to be safely moved.
Fix it at the Source
The ultimate solution, if it is feasible, is to fix source of data systems. If the problem can be traced to a source system responding slowly or providing data late. Fix that, if possible.
A typical example is to create a relational DB maintained table where the complex SQL that can be run in a TM1 process is pushed into the relational DB so that TM1 doesn’t have to do complex joins “on the fly”.
System developers sometimes avoid this as a tactic during development as it involves client DB resources that are unavailable at build time, but it often can dramatically speed up process execution.
Another example is a text (aka CSV) data file arriving late on some days due to non-PA/TM1 issues at the sending side.
A third example of internal to TM1 source problems could be processes that are slow to create data source views notably with large dimensions involved) or slow processing due to inefficient rules and feeders, dimension order, or cube structures. Optimizing “internal to TM1 things” can play big dividends in memory usage, response speed, and user satisfaction.
Fix it at the source, Part 2 can you do anything about the data you get?
Chores Part 1
It is important to review chores, as it is common that over the years new processes get added to an existing chore, or existing processes have been asked to do more.
- Do chores need to run all the processes every time?
- Can the chores be split into, for example, a slower, complete overnight chore and a quicker hourly chore? (e.g. load the year once a day, the current month every 3 hours, or the big rebuild “once a day”, and minor things on-demand)?
- Can some chores be split up into smaller processing “chunks” since they are unrelated?
- Can some things be made into “on demand” chores – only?
Sometimes, chore contents are fine but just need to be run sequentially. The solution might be as simple as combining two chores that “need to be sequential” into one chore. That way if it takes 30min or 3hours from 00:31 AM for a nightly chore, it’s of little concern. Other things that run fast can be pushed off to “just before start of business”, or later.
More effective use of process logic inside a chore
A possibility in some systems is to eschew the direct chores initiation of processes, and use single or multiple processes to themselves execute “sub-processes” from within themselves, this allows for sophisticated logic and order timing processes, for example day of week/time logic or other time or attribute driven conditional execution, but there are some limitations.
Unlike the ability of chores to commit at each process in process calls the commit happens when the one process calls another on the completion of the calling process.
- Process A Calls
- Process A1 which executes and finishes
- Process A2 executes & calls
- Process A21
- Process A22
- Process A3 executes and finishes
- A21 and A22 can use A2’s prior to call point changes
- but A22 cannot use A21’s until A2 is finished
- or process A1’s because the mother process A is not done.
- A1 and A2 and A3 effectively are independent and their changes will not be accessible, or even visible to each to other until the mother process A is complete.
This is not necessarily a problem if the changes pushed though do not have a dependency, but if they do, the resulting errors can be quite subtle and difficult to discern and understand.
(i.e. At the completion of process A it will not be clear what the sub-processes were actually looking at while they were executing.)
More Chore Management Part 2: Fix the Processes
It is not unusual when digging into old processes to discover that it may be possible to optimize them based on more current information on the data, or best practices, or even newer TM1 options.
With larger volume data loads executing the metadata ( dimensional rebuild/maintenance) portion separately as a process focused on rare dimensional issues, and then have separate “data load” process where the data section uses modern TM1 “direct” dimension edit functions in case there happens to be a minimal number of new dimensional elements. (This is common, a new account is added, for example) Not having a metadata tab could as much as halve your data load time since the metadata tab has to execute for each data line.
Note: There is a “direct” formula limitation, too many direct dimensional edits in a single dimension can result in very considerable process slowness/overhead, so this tactic should be limited to cases where the dimension is only expected to have a small number of new differences.
Sometimes data sources can be split (e.g. by month, entity) and TM1/PA’s newer parallel processing capabilities deployed. Given the right set up, both hardware and TM1 coding, this can massively increase the throughput of large data loads.
Loading less data, some systems were designed with one thought perspective in mind, and operational perspective may allow optimization. E.g. Instead of loading a year (or an entire cube) every time, a tactic might be to only load prior period and current period, with older months/periods loaded “on demand” by users if needed (or perhaps scheduled overnight).
Avoid the misuse of SaveDataAll command in TI’s: It should be run its own special purpose process with very few or no other actions. It should also not be run too often, TM1 transaction logging + sparing command usage is a better solution. This command and action can be very disruptive of other processes. As a rule of thumb, once per chore is enough, and sometimes even that is too much
Other process fixes possibilities include:
- Data sources that are inefficient (e.g. processing zeros, or presenting too much information, like consolidated data, or unwanted periods that are ignored by the process)
- Using CELLGET/CELLPUT combinations rather than modern CELLINCREMENT
- Excessively complex in-process logic: is it possible to move logic to source data creation (better SQL, better datafile creation) can disqualifying logic and a ITEMSKIP command be used to short cut around extraneous data early in a tab?
- Instead of building logic into a process that has to be evaluated for each data point, is it possible to “export” that logic to a precalculated attribute?
- Replacing dimension “cycling through & check” logic with “MDX to static subset” in view /subset data and Zero-out’s, especially in large dimensions.
- Another common mistake is running ProcessFeeders or SecurityRefresh too often, or at all., like SaveDataAll they can be quite disruptive. It is better to have a single purpose process at Chore end, if they are needed.
Chore Management Part 3: Flexible run Logic a.k.a. Run Semaphores
If the chore and process optimization steps have been taken, or are not feasible, there remains a powerful coding solution.
This is sometimes referred to as a “process semaphore” or a process start delay. In short, this is an indication/ “flag” by a process in a chore to a location, usually a so-called control cube, that says “I am currently running”, subsequent relevant chores or processes check the flag and if encountered in the “running” state either pause or quit.
The design choice of whether one pauses or quits the subsequent chore (or process) may depend on the frequency of subsequent chore and its purpose. E.g. If Chore B runs every hour, and the long-running chore runs once a day at 3 AM, it’s probably not necessary to run 2nd chore this hour. If the subsequent chore is mission-critical an example of which is a process that moves the data loaded by chore A to other cubes in a system, or processes an allocation, that’s a different calculus.
The complexity of run semaphores can vary on the environment. Some are very simple, single purpose “go /no go” signals, others may vary by chore or time of day.
E.g. If “Chore A” is running pause the Chore C 5min, but if “Chore B” is running, then pause Chore C 15 min.
Still, others might keep trying every 5 min, 6 times, then write an error to logs, or elsewhere, explaining the problem for further investigation by Administrators or developers
Coding a Process Run Semaphore
The prime consideration for coding a semaphore is robustness, so a developer must take the time to consider all the possible failure modes and whether they are acceptable in system context, often writing out to logs, logging cubes, or having other indicators is critical for proper future operations.
Processes and Chore can, for example, leave 1/0 indications when they last ran and at what time they started and finished, to the great benefit of system management.
Coding a Semaphore:
- Set the chore to “multiple commit” (almost always a good idea)
- Create a new “Pause/Start” semaphore process
- When it runs it checks a control cube value for status, the “I am running” flag value
- It also, optionally, pulls a “pause” time/count
- If it is allowed to run (no pre-emption), it sets the “I am running” flag, ends the process, moves to next process in the chore.
- If it is preempted, go into a pause loop. (about which, more below)
- When the loop ends recheck flag
- If the loop count (e.g. we have checked 6 times) exceeds the desired amount, write an error to the logs and quit the chore (e.g. stop further chore processing) it is better to generally have several checks on a smaller pause interval than one big pause.
- It may be a good idea to write out “tracing” messages during the pause loops, or etc.…
- Insert the Process in all relevant chores as the first process with appropriate variables
- Create an End Semaphore Process
- It sets the “I am running” flag value to “Not Running”
- If you have multiple different flags, this should be a parameter variable driven.
- This process is inserted at the end of each relevant chore.
- It is a good idea to write out a log message, or another positive indicator.
In addition: It is good idea to schedule a maintenance chore/process that will reset all flags to “ I am not running” state at some point and at start up, in case there is a chore failure/process and a chore never completes, or the system crashes in an unfinished state and needs to be reset.
Finally: Don’t allow your pauses to run longer than chore interval.
How to Pause a process in TM1
There are two methods to pause a process in TM1.
Officially, the only way is to set up a count loop, as TM1 is not provided with a pause function in Turbo Integrator
- e. Set a value, count down to zero in a while loop
- Since every server’s processing speed is different, this needs to be calibrated
- Noticeably this is fragile as changes in hardware or processing availability can suddenly change the timing.
- This is also a less than ideal implementation from the system resources point of view.
The better solution is an undocumented, but long-standing (years), command in TI:
- SLEEP(X), where X is a numerical value in milliseconds.
- This has been tested on several systems and seems to be reliable and efficient.
Conclusion: Preventing Chore Time Overlap
There are techniques that can be used to prevent process and chore interference in time, as detailed above.
Knowing what to look for, what to fix, and what to live within some fashion, is a matter of experience.
When a problem of this nature is noted it should not be ignored, as this sort of problem generally does not go away on its own and generally recurs specifically at times when systems are under other stresses such as month close, end of budget period, and the sorts of data problems that result from interfering processes and chores can be devasting and disruptive to user confidence in systems.
From a practical point of view, such problems can take a while to diagnose, implement, and be tested before they are deployed to production system.
Also, please note that we have not discussed how to handle chores and processes that are user-initiated (on-demand,) which can both interfere with other users, and with system driven chores and processes. Many of the same tactics noted in this article can be used, but there are additional considerations.