Cassandra¶
Installation:
Go to http://cassandra.apache.org/download/ and download latest version of the cassandra or "git clone git://git.apache.org/cassandra.git" to download the latest source and build it.
Open the following firewall ports for the security group hosting the Cassandra cluster nodes
Cassandra Port Port number
Gossip port 7000
JMX Port 8080
Thrift Port 9160
Directories for cassandra:
Create Cassandra default data directory, cache directory and commit log directory
sudo mkdir -p /var/lib/cassandra/data
sudo mkdir -p /var/lib/cassandra/commitlog
sudo mkdir -p /var/lib/cassandra/saved_caches
sudo chown -R
Create Cassandra logging directory
sudo mkdir -p /var/log/cassandra
sudo chown -R <User> /var/log/cassandra Configuring Cassandra:
vi conf/cassandra.yaml
change the "cluster_name" to the name of the cluster
1. Change the listening address for Cassandra & Thrift 2.
Install JNA for Cassandra: Java Native Access on Linux improve Cassandra memory usage and performance
Download jna.jar from http://java.net/projects/jna/downloads/directory
Add jna.jar $CASSADRA_HOME/lib vi /etc/security/limits.conf
$USER soft memlock unlimited
$USER hard memlock unlimited 3. Repeat the above steps for every node in the ring/cluster
4. Configure the Cassandra seeds
Select a sub-set of ring nodes as seeds. Non-seed nodes contact the seed nodes to join the ring
Defined at least one but preferable more for fault tolerance
Seeds are contacted when joining the ring, no other communication with seeds is necessary afterwards
All nodes should have the same seed list
For each nodes, edit cassandra.yaml to add the Cassandra cluster seeds
seeds:
- <Host1>
- <Host2> 5. Change the Cassandra initial_toke value "initial_token"
The initial_token value for each node is "i*(2**127)/number_of_nodes" Which i starts from 0 to (number_of_nodes-1)
6. For non-seed nodes only, perform data migration automatically with Cassandra bootstrap
auto_bootstrap: true a. auto_bootstrap auto migrate range of data to the new node
b. To add a new seed, start the node as a non-seed node with auto_bootstrap to migrate the data first. Then turn auto_bootstrap off and make it to a seed node 7. If you plan to use range queries, then you have to choose ordered partition.
The partitioner is responsible for distributing rows (by key) across nodes in the cluster. Any IPartitioner may be used, including your own as long as
it is on the classpath. Out of the box, Cassandra provides org.apache.cassandra.dht.RandomPartitioner org.apache.cassandra.dht.ByteOrderedPartitioner,
org.apache.cassandra.dht.OrderPreservingPartitioner (deprecated), and org.apache.cassandra.dht.CollatingOrderPreservingPartitioner (deprecated).
- RandomPartitioner distributes rows across the cluster evenly by md5. When in doubt, this is the best option.
- ByteOrderedPartitioner orders rows lexically by key bytes. BOP allows scanning rows in key order, but the ordering can generate hot spots for
sequential insertion workloads.
- OrderPreservingPartitioner is an obsolete form of BOP, that stores keys in a less-efficient format and only works with keys that are UTF8-encoded Strings.
- CollatingOPP colates according to EN,US rules rather than lexical byte ordering. Use this as an example if you need custom collation.
See http://wiki.apache.org/cassandra/Operations for more on partitioners and token selection. partitioner: org.apache.cassandra.dht.OrderPreservingPartitioner 8.Also we can modify the default directories for logging.Refer cassandra.yaml which will explain each parameter in detail
Starting Cassandra
Cassandra Options are configured in bin/cassandra.in.sh
Cassandra environment options are configured in conf/cassandra-env.sh
For production system
make a copy of cassandra.in.sh as prod.in.sh
make changes to the copy
start Cassandra as CASSANDRA_INCLUDE=/path/to/prod.in.sh bin/cassandra
To start Cassandra as a non-demon process, use the "-f" option bin/cassandra -f
To kill Cassandra with a script
Record the process id to a file " cassandra -p /var/run/cass.pid "
Kill the process kill $(cat /var/run/cass.pid)
Cassandra Log4J logging configuration file is located in "conf/log4j-server.properties "
To monitor the Cassandra log files
tail -f /var/log/cassandra/output.log
tail -f /var/log/cassandra/system.log
Adding New node to cassandra:
1. Install cassandra in all hosts where you are planning to run the nodes
2. auto_bootstrap should be set to false for initial node i,e, "auto_bootstrap: false" .for all non-seed nodes keep " auto_bootstrap: true " in cassandra.yaml
3. For other new nodes,k calculate the initial seed value using the formula " i*(2**127)/number_of_nodes with i starts from 0 to (number_of_nodes - 1)".For initial node, set the initial_token to zero .i,e, initial_token : 0
4. Start the new nodes sequentially. Use nodetool to monitor the startup is completed before starting the next one. "nodetool netstats"
5. For each existing node, run nodetool to change the token . " nodetool move
6. For each existing node, run nodetool to remove data that is migrated to other nodes . " nodetool cleanup"
Cassandra Data Backup & Recovery
1. Single node snapshot: nodetool -h
2. cluster snapshot: clustertool -h
Snapshot data will be stored in /var/lib/cassandra/data/mykeyspace/snapshots/timestamp-snapshotname/*.db
3. To delete all snapshots of the cassnadra node use " nodetool -h
4. To delete all Cassandra snapshots in a cluster : nodetool -h
Incremental Cassandra Backup
1. To enable Incremental backup: incremental_backups: true in cassandra.yaml
When incremental backup is enabled (default is off), Cassandra persists flushed SSTable to a backup directory under
/var/lib/cassandra/data/mykeyspace/backups/
Old incremental backup files needs to be manually removed. Consider removing them after snapshots
With these incremental backup files in conjunction with a snapshot, an administrator can restore data in a node when data corruption occurs
Restore Cassandra from Backups
1. Shut down the node to be restored
2. Clear commitlog: Clear files under the folder
rm /var/lib/cassandra/commitlog/*
3.For every keyspace, remove the db files
rm /var/lib/cassandra/data/mykeyspace/*.db
Do not remove the snapshots directory in it
4. Locate the latest snapshot directory
/var/lib/cassandra/data/mykeyspace/snapshots/timestamp-thissnapshotname
5. Copy the snapshot to the data directory
cp -p /var/lib/cassandra/data/mykeyspace/snapshots/1304617358646-mylatestsnapshot/* /var/lib/cassandra/data/mykeyspace
6. Copy the incremental backups to the data directory
cp -p /var/lib/cassandra/data/mykeyspace/backups/* /var/lib/cassandra/data/mykeyspace
7. Repeat the above steps for other keyspaces
8. Restart the node
The restart can be CPU and I/O intense because of the data compaction during the restoration
Casandra nodetool repair
Run Casandra nodetool repair when
1. a suspicious data lost or failure happens, use nodetool to repair Casandra data from its replicas
2. Run nodetool repair periodically on all nodes in the cluster within every GCGraceSeconds (defualt 10 days) to remove deleted rows
3. This operation is CPU and disk intense. Run in sequentially and one node at a time
Cassandra Authentication & Authorization
1. Edit cassandra.yaml to enable authentication " authenticator: org.apache.cassandra.auth.SimpleAuthenticator "
2. Edit access.properties for individual user read and write privileges " MyKeySpace[.MyColumnFamily].PERMISSION=MyUsers "
Ex:
- Right to modify list of keyspaces
=santosh
- Access to Keyspace1
Keyspace1.=santosh,mahesh
Keyspace1.=mahesh
- Access to Column Family Standard1 of Keyspace1
Keyspace1.Standard1.=santosh,mahesh,abc
*********************************************************
where
ro: read onlly
rw: read and write privilege
3. Edit the password file passwd.properties for users password
santosh=hanuman
4. To enable MD5 for password encryption : open conf/cassandra-env.sh --> add/change JVM_OPTS="$JVM_OPTS -Dpasswd.mode=MD5"
5. Restart cassandra with the access and password files
bin/cassandra -f -Dpasswd.properties=conf/passwd.properties -Daccess.properties=conf/access.properties