Database Monitoring Automation starts from simple steps with Grid Control

Imagine you have hundreds of databases, or more. Each time when you add a database to be monitored, you have to do an action to apply a monitoring template.

Why not apply a default template each time a new database target is discovered by Grid Control and avoid extra steps!

Here’s how!

Advertisements

Multi-server Grid Control 11g challenge #1 – shared receive/loader directory for cluster

You’d think building up a multi-server Grid Control 11g setup is easy… well, definitely not in every case, if you are considering choosing  a stable shared filesystem.

Just to choose a stable shared / cluster filesystem for Red Hat  Linux OMS nodes, which is a requirement for multi-node uploaders?  Good luck.. not many great choices.

The officially supported choices from Oracle are:

  • NFS
  • OCFS2

Other people have been testing also ASM ACFS (ASM Cluster Filesystem) and GFS2 (global filesystem). But these are not recommened by Oracle (no support).

However, if in your environment thousands of XML files are generated, then that can be slow on NFS. NFS can also have its weaknesses.

OCFS2 – only version 1.4.x (latest is 1.4.7)  is currently supported by Oracle for REDHAT Linux . However, 1.4.7 has a problem with fragmentation, especially if you create/delete lots of file, it can occur.

There’s a workaround, but.. its only temporary! But it can reduce the chances for fragmentation to occur as bad as it would by default.

OCFS2 at least in the past has caused entire cluster crashes.  Even reportedly with the latest versions, as reported by users on the mailing lists.

The problem is fixed in 1.6 of OCFS2, but that 1.6 seriesis only supported on Oracle Unbreakable Linux.

It is interesting to notice that neither GFS2 or ACFS (ASM FS) is recommended by Oracle, although these are often marketed as ‘general-purpose filesystems’.

In the end, maybe choosing NFS, is the easiest, even with its caveats, if any. NFS does not  normally crash entire cluster and web service stays responsive for users, which is often most  important.

REFERENCES

 Newest versions of OCFS2 1.6 with known bugfixes (fragmentation issue) are only available for Oracle Enterprise Linux subscribers with Unbreakable Enterprise Kernel. Oracle Cluster File System (OCFS2) Software Support and Update Policy for Red Hat Enterprise Linux Supported by Red Hat [ID 1253272.1] on page 3

    OCFS2 is not supported for general filesystem use on RHEL without an Oracle Linux support subscription. Oracle provides OCFS2 software and support for database file storage usage for customers who receive Red Hat Enterprise Linux (RHEL) operating system support from Red Hat and have a valid Oracle database support contract. [1]

Oracle Cluster File System (OCFS2) Software Support and Update Policy for Red Hat Enterprise Linux Supported by Red Hat [ID 1253272.1] on page 3

 

   Available version 1.4.7 for Red Hat Enterprise Linux is known to be problematic . Especially fragmentation for filesystems with heavy activity delete/creation of files.  Tuning can decrease the occurrence of this problem. [ 

  GFS/2  and ACFS is supported by Oracle as a general purpose filesystem

(see Using Redhat Global File System (GFS) as shared storage for RAC [ID 329530.1]

 

 

OCFS2 software can be downloaded from the Oracle OCFS2 Downloads for Red Hat Enterprise Linux Server 5 page on OTN. Updates and bug fixes to OCFS2 are provided for Linux kernels that are part the latest supported release of RHEL5 listed in the Enterprise Linux Supported Releases document and three update releases prior to that. For example, if the latest supported release of RHEL is RHEL5, Update 5 (RHEL5.5), then updates to OCFS2 will be provided for RHEL5.2, RHEL5.3, RHEL5.4 and RHEL5.5 kernels.

http://oss.oracle.com/projects/ocfs2/files/RedHat/RHEL5/x86_64/

Supported Filesystems by Oracle/ Red Hat

Enterprise Linux: Linux, Filesystem & I/O Type Supportability [ID 279069.1]


1.1.1   Enterprise Linux 5 (EL5)

The following table is both for RedHat Enterprise Linux 5 and Enterprise Linux 5 provided by Oracle.

Linux Version RDBMS File System I/O Type Supported
EL5 10g ext3 Async  I/O Yes
raw Depr. (2)
block Yes (3)
ASM (4) Yes
OCFS2 Yes
NFS Yes
GFS Yes (5)
GFS2 Yes (6)
ext3 Direct I/O Yes
raw Depr. (2)
block Yes (3)
ASM(4) Yes
OCFS2 Yes
NFS Yes (7)
11g ext3 Async  I/O Yes
raw Depr. (2)
block Yes (3)
ASM (4) Yes
OCFS2 Yes
NFS Yes (7)
GFS Yes (5)
GFS2 Yes (6)
ext3 Direct I/O Yes
raw Depr. (2)
block Yes (3)
ASM(4) Yes
OCFS2 Yes
NFS Yes
  1. See http://oss.oracle.com/projects/ocfs2/ for the current certification of Oracle products with OCFS2
  1. Deprecated but still functional. With Linux Kernel 2.6 raw devices are being phased out in favor of O_DIRECT access directly to the block devices. See Note 357492.1
  2. With Oracle RDBMS 10.2.0.2 and higher. See Note 357492.1
  3. Automatic Storage Manager (ASM) and ASMLib kernel driver. Note that ASM is inherently asynchronous and the behaviour depends on the DISK_ASYNCH_IO parameter setting. See also Note 47328.1 and Note 751463.1
  4. See Note 329530.1 and Note 423207.1 for GFS related support
  5. GFS2 is in technical preview in OEL5U2/RHEL5U2 and fully supported since OEL5U3/RHEL5U3.
  6. DIRECTIO is required for database files on NAS (network attached storage)

OCFS2  BUGS  and FEATURES

  Write To Ocfs2 Volume gets No Space Error Despite Large Free Space Available [ID 1232702.1]

 

 

  Error “__ocfs2_file_aio_write:2251 EXIT: -5” in Syslog [ID 1129192.1]

Basically this is a known issue which is not a bug in OCFS2 itself, the actual bug lies in the Linux kernel. The cause was addressed in Oracle Bug 8627756 and Red hat Bug 461100

   OCFS2: Network service restart / interconnect down causes panic / reboot of other node [ID 394827.1]

 

Cause

This is expected OCFS2 behaviour.

   OCFS2 Kernel Panics on SAN Failover [ID 377616.1]

Changes

Testing cluster resilience to hardware failures.

Cause

The default OCFS2 heartbeat timeout is insufficient for multipath devices.

To check the current active O2CB_HEARTBEAT_THRESHOLD value, do:

 

# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold

The above information is taken from the “Heartbeat” section of

http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html

Solution

To implement the solution, please execute the following steps:

 

   1. Obtain the timeoutvalue of the Multipath system. Assume for now it is 60.

   2. Edit /etc/sysconfig/o2cb, add parameter: O2CB_HEARTBEAT_THRESHOLD = 31

   3. Repeat on all nodes.

   4. Shutdown the o2cb on all nodes

   5. Start the o2cb on all nodes

   6. Review also Q 61 of http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html and action as required.

   7. Ensure that any proprietary device drivers are at the latest production release level

 

   Discussion from discussion list

 

   Recommendation – use OCFS2 1.6 IF POSSIBLE TO AVOID FRAGMENTATION

> As a general recommendation, I suggest you to use the Oracle’s

> Unbreakable Linux kernel on top of CentOS 5.4 (or RedHat EL 5.4) and

> OCFS2 1.6, also provided by Oracle, to avoid the known fragmentation

> bug and to be able to enable the indexed directories feature for

> improved performance.

  Problems with OCFS2 among users

Bug list

http://oss.oracle.com/bugzilla/buglist.cgi?product=OCFS2&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=NEEDINFO&query_format=advanced&order=bugs.bug_id&query_based_on=

 

   [Ocfs2-users] Anomaly when writing to OCFS2

OK, Now for the problem.  When an Oracle Interface/Data Load Program is started  (this loads data into the db) through Concurrent Manager, we see the log file for this operation (which stores errors etc.) being create in the out directory under $APPLCSF/$APPLOUT, but it is always of 0Kb.  We know this worked on one of our tests systems (using NFS rather than OCFS2) so we know that the process works, we just don’t know why it doesn’t work under OCFS2.

 

 

>> So everything works fine as long as the temp file is being written to.

>> The out file is the issue.

> Not exactly, the temp file is always written to whether the temp file is located on OCFS2 or sym linked to the EXT3 partition.  The problem occurs right at the end of the process;

   Server hang with two servers

 

> On 04/08/2011 03:56 AM, Xavier Diumé wrote:

>> Hello,

>> Two servers are using iscsi to mount lun’s from a SAN. The filesystem used is ocfs2 with Debian 6, with a multipath configuration.

>> The two servers are working well together but something hangs one of the server. Look at  dmesg outout:

>> 

 

On 04/08/2011 09:25 AM, Mauro Parra wrote:

> Hello,

> I’m getting this error:

> $fsck.ocfs2  /dev/mapper/360a98000572d434e4e6f6335524b396f_part1

> fsck.ocfs2: File not found by ocfs2_lookup while locking down the cluster

> Any idea or hint?

 

This means it failed to lookup the journal in the system directory.

 

   [Ocfs2-users] OCFS2 crash

Fri Apr 15 01:12:51 PDT 2011

We’re running ocfs2-2.6.18-92.el5-1.4.7-1.el5, on a 12 node Red Hat 5.2

64bit cluster. We’ve now experienced the following crash twice:

   servers blocked on ocfs2

 

Last modified: 2011-01-06 23:28 oss.oracle.com Bugzilla Bug 1301

                             

It is not necessary to say we are quite worried about this. We are seriously

near to give up on OCFS2 because the problem is affecting too much our services.

If you do not see a close solution to this, please tell us.

OCFS2 run slower and slower after 30 days of usage

oss.oracle.com Bugzilla Bug 1281             Last modified: 2010-12-01 23:42

(In reply to comment #86)

> I’m sorry we haven’t gotten to this sooner.  I’m not sure we can relax

> j_trans_barrier.  It is a good question why journal action blocks us so much.

 

We are in the process of deciding what to do with or without OCFS.

 

One question, if we change the behavior of OCFS usage, by storing only much much

less frequently change file/folder on OCFS, like storing only the code. Do you

think the same problem will arise in this mode?

   Bonnie++ file system benchmark tool crashes OCFS2

 

Last modified: 2010-05-05 11:25

I have a testcluster consisting of 2 nodes with a shared storage below them.

The ocfs2 filesystem was mounted (module version 1.50, ocfs2-tools version

1.4.3, Linux Debian system, 32 bit) and on both I ran the benchmarking tool

‘bonnie++’.

After a few minutes both systems crashed and rebooted.

 

The kernel message I got was:

thegate:/mnt/1a# bonnie++ -s 4000 -u root -r 2000

Using uid:0, gid:0.

Writing a byte at a time…

Message from syslogd@thegateap at May  5 16:21:12 …

 kernel: [56806.205914] Oops: 0000 [#1] SMP

 

Locking issues with shared logfiles

 

Description:  [reply]        Opened: 2009-04-20 22:10

I have a cluster with 5 nodes hosting web application. All web servers

save log info into shared access.log file. There is awstats log

analyzer on the first node. Sometimes this node fails with the

following messages (captured on another server)

 

Two nodes doing parallel umount hang cluster

Opened: 2011-02-04 17:00  oss.oracle.com Bugzilla Bug 1312

Two nodes doing parallel umount hang. dlm locking_state shows the same lockres

on both nodes. The lockres has no locks.

 

pointer dereference in Linux 2.6.33.3 : ocfs2_read_blocks when peer hang

oss.oracle.com Bugzilla Bug 1286

                              NULL pointer dereference in Linux 2.6.33.3 : ocfs2_read_blocks when peer hang                Last modified: 2010-08-19 03:14

After node 2 was shut down electrically with ocfs2 mounted (test scenario), the

first node crashes …

We also often have to fsck -f the block device otherwise mount.ocfs2 crashes

when attempting to mount the device. It doesn’t work with only fsck (without -f)

 


 



 



Normal
0

false
false
false

EN-US
X-NONE
X-NONE

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-qformat:yes;
mso-style-parent:””;
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”,”serif”;}

 

http://oss.oracle.com/bugzilla/buglist.cgi?product=OCFS2&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=NEEDINFO&query_format=advanced&order=bugs.bug_id&query_based_on=

Not Getting Alerts for ORA Errors for 11g Databases and Cannot View the alert.log Contents for 11g Databases From Grid Control 10.2.0.5 [ID 947773.1]

Another nice feature in 10.2.0.5 agent monitoring 11g databases.

 

Watch out!

1 ) Configure “Incident” and “Incident Status” metrics to monitor for ORA errors in alert.log. (log.xml ) on 11g databases.

2) After the EM agent installation, users need to manually set up a symbolic link of ojdl.jar in
$ORACLE_HOME/sysman/jlib pointing to $ORACLE_HOME/diagnostics/lib where
$ORACLE_HOME is the EM agent Oracle home. For example, on Linux and Unix
platforms, users need to type the following:
1) cd $ORACLE_HOME/sysman/jlib
2) ln -s ../../diagnostics/lib/ojdl.jar
.
On platforms where symbolic link is not supported, users need to manually
copy the ojdl.jar from $ORACLE_HOME/diagnostics/lib into
$ORACLE_HOME/sysman/jlib after the EM agent installation.
.
Users also need to ensure that the copy of ojdl.jar in
$ORACLE_HOME/sysman/jlib is updated if the ojdl.jar in
$ORACLE_HOME/diagnostics/lib is patched in the future.