Tweak jepsen partition tests having better output"

This commit is contained in:
Janne Valkealahti
2015-08-23 19:16:51 +01:00
parent 5879332f9f
commit 8dff5eec4d
6 changed files with 142 additions and 28 deletions

View File

@@ -537,31 +537,79 @@ What's happening in above chart:
==== Partition Tolerance
We need to always assume that sooner or later things in a cluster will
go bad whether that is just a crash of a `Zookeeper` or a state
go bad whether it is just a crash of a `Zookeeper` instance, a state
machine or a network problem like a `brain split`. Brain split is a
situation where existing cluster members are isolated so that only
part of a hosts are able to see each others. Usual scenario is that a
brain split will create a minority and majority of an ensemble where
hosts in a minority cannot participate in an ensemble anymore until
network status has been healed.
brain split will create a minority and majority partitions of an
ensemble where hosts in a minority cannot participate in an ensemble
anymore until network status has been healed.
In this test we will demostrate that a various types of brain-split's in
an ensemble will eventually cause n fully synchronized state of all
In below tests we will demostrate that various types of brain-split's in
an ensemble will eventually cause fully synchronized state of all
distributed state machines.
image::images/sm-tech-partition-half.png[width=500]
There are two scenarious having a one straight brain split in a
network where where `Zookeeper` and `Statemachine` instances are
split in half, assuming each `Statemachine` will connect into a
local `Zookeeper` instance:
* If current zookeeper leader is kept in a majority, all clients
connected into majority will keep functioning properly.
* If current zookeeper leader is left in minority, all clients will
disconnect from it and will try to connect back till previous
minority members has succesfully joined back to existing majority
ensemble.
[NOTE]
====
In our current `jepsen` tests we can't separate zookeeper split brains
scenarios between leader left in marojity or minority so we need to
run tests multiple time to accomplish this situation.
====
[NOTE]
====
In below plots we have mapped a state machine error state into a
`error` to indicate that `state machine` is in error stete instead or
a normal state. Please indicate this when interpering chart states.
====
In this first test we show that when existing zookeeper leader was
kept in majority, 3 out of 5 machines will continue as is.
image::images/sm-tech-partition-half-1.png[width=500]
What's happening in above chart:
* First event `C` is sent to all machine leading a state change to
`S211`.
* Jepsen nemisis will cause a brain-split which is causing partitions
of `n1/n2/n5` and `n3/n4`. Nodes `n3/n4` are left in minority and
nodes `n1/n2/n5` constructs a new healthy majority. Nodes in
nodes `n1/n2/n5` construct a new healthy majority. Nodes in
majority will keep function without problems but nodes in minority
will get into error state.
* Jepsen will heal network and after some time nodes `n3/n4` will join
back into ensemble and synchronize its distributed status.
* Lastly event `K1` is sent to all state machines to ensure that ensemble
is working properly. This state change will lead back to state
`S21`.
In this second test we show that when existing zookeeper leader was
kept in majority, all machines will error out:
image::images/sm-tech-partition-half-2.png[width=500]
What's happening in above chart:
* First event `C` is sent to all machine leading a state change to
`S211`.
* Jepsen nemisis will cause a brain-split which is causing partitions
so that existing `Zookeeper` leader is kept in minority and all
instances are disconnected from ensemble.
* Jepsen will heal network and after some time all nodes will join
back into ensemble and synchronize its distributed status.
* Lastly event `K1` is sent to all state machines to ensure that ensemble
is working properly. This state change will lead back to state
`S21`.
==== Crash and Join Tolerance
In this test we will demostrate that killing existing state machine

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

View File

@@ -13,7 +13,7 @@
[gnuplot.core :as g]))
(def nodetovalue {"n1" 1 "n2" 2 "n3" 3 "n4" 4 "n5" 5 })
(def statetovalue {"S11" 1 "S12" 2 "S211" 3 "S212" 4 })
(def statetovalue {"S11" 1 "S12" 2 "S211" 3 "S212" 4 "error" 5 })
(def variabletovalue {"v1" 1 "v2" 2 "v3" 3 "v4" 4 "v5" 5 "v6" 6 "v7" 7 "v8" 8})
(defn shiftvalue
@@ -76,6 +76,20 @@
(vector
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (get nodetovalue (name (last (keys (get value :value))))) "X")) ) [] (vec data)))))
(defn extract-plot-data6
[history]
(let [data
(->> history
(filter #(= :ok (:type %)))
(filter #(= :statesnoexpect (:f %)))
(group-by :process))]
(vector
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 0))
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 1))
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 2))
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 3))
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 4)))))
(defn plot1!
[test model history]
@@ -172,6 +186,38 @@
output-path)
{:valid? true})
(defn plot4!
[test model history]
(let [output-path (.getCanonicalPath (store/path! test "states.png"))]
(g/raw-plot! [[:set :key :outside]
[:set :style :textbox :opaque]
[:set :terminal :qt :size (keyword "900,450")]
[:set :yrange (keyword "[0.5:5.5]")]
[:set :y2range (keyword "[0.5:5.5]")]
[:set :xtics :format "%h\nns"]
[:set :xlabel "elapsed time"]
[:set :ylabel "states in nodes"]
[:set :y2label "events via nodes"]
[:set :ytics 1]
[:set :ytics (keyword "('S21' 1, 'S22' 2, 'S211' 3, 'S212' 4, 'error' 5)")]
[:set :ytics :nomirror]
[:set :y2tics 1]
[:set :y2tics (keyword "('n1' 1, 'n2' 2, 'n3' 3, 'n4' 4, 'n5' 5)")]
[:plot
(g/list ["-" :title "states n1" :with :steps :lw :3]
["-" :title "states n2" :with :steps :lw :3]
["-" :title "states n3" :with :steps :lw :3]
["-" :title "states n4" :with :steps :lw :3]
["-" :title "states n5" :with :steps :lw :3]
["-" :title "events" :with :labels :center :boxed :font ",15" :axis :x1y2]
)]]
(into
(extract-plot-data6 history)
(extract-plot-data2 history)))
output-path)
{:valid? true})
(defn checker1
"Constructs a Jepsen checker."
[]
@@ -192,3 +238,10 @@
(reify Checker
(check [_ test model history]
(if (env :plot) (plot3! test model history) {:valid? true}))))
(defn checker4
"Constructs a Jepsen checker."
[]
(reify Checker
(check [_ test model history]
(if (env :plot) (plot4! test model history) {:valid? true}))))

View File

@@ -18,6 +18,7 @@
[spring-statemachine-jepsen.checker :refer [checker1]]
[spring-statemachine-jepsen.checker :refer [checker2]]
[spring-statemachine-jepsen.checker :refer [checker3]]
[spring-statemachine-jepsen.checker :refer [checker4]]
[jepsen.checker.timeline :as timeline]
[jepsen.control.net :as net]
[jepsen.os.debian :as debian]
@@ -41,18 +42,20 @@
(http/post (str "http://" (name node) ":8080/event")
{:form-params {:id (str event) :testVariable value}}))
(defn sm-read-states
"Reading states from a state machine"
[node]
(let [response (http/get (str "http://" (name node) ":8080/states") {:as :json})]
(get response :body)))
(defn sm-read-status-ok?
"Read status and check that there is no errors"
[node]
(let [response (http/get (str "http://" (name node) ":8080/status") {:as :json})]
(= (get (get response :body) :hasStateMachineError) false)))
(defn sm-read-states
"Reading states from a state machine"
[node]
(if (sm-read-status-ok? node)
(let [response (http/get (str "http://" (name node) ":8080/states") {:as :json})]
(get response :body))
(vec ["error"])))
(defn sm-read-state-variable
"Read status and check that there is no errors"
[node key]
@@ -82,8 +85,7 @@
(Thread/sleep 1000)
(if (sm-read-status-ok? node) false true)
(catch Exception e true))
(recur))))
)
(recur)))))
(defn start!
[node]
@@ -155,6 +157,11 @@
)
(catch RuntimeException e
(assoc op :type :fail :value (.getMessage e))))
:statesnoexpect (try
(let [res (sm-read-states client)]
(assoc op :type :ok :value (vec res)))
(catch RuntimeException e
(assoc op :type :fail :value (.getMessage e))))
:event (try
(sm-send-event client (:e op))
(assoc op :type :ok :value (:e op))
@@ -187,6 +194,17 @@
:f :states
:s expect}]))))))
(defn gen-read-states-noexpect
"Read states n times."
[times]
(gen/clients
(gen/each
(gen/seq
(take (* times 2)
(cycle [(gen/sleep 1)
{:type :invoke
:f :statesnoexpect}]))))))
(defn gen-send-event
"Send event one time to random node."
[event]
@@ -307,24 +325,19 @@
"Generates event and checks states while splitting network"
[]
(gen/phases
(gen-read-states 5 ["S0","S1","S11"])
(gen-read-states-noexpect 10)
(gen-send-event-all "C")
(gen-read-states 5 ["S0","S2","S21","S211"])
(gen-status 2)
(gen-read-states-noexpect 10)
;start nemesis, split network
(gen/nemesis
(gen/once {:type :info :f :start}))
(gen-read-states 15 ["S0","S2","S21","S211"])
(gen-status 5)
(gen-read-states-noexpect 15)
;stop nemesis, heal network
(gen/nemesis
(gen/once {:type :info :f :stop}))
(gen-status 5)
(gen-read-states 15 ["S0","S2","S21","S211"])
(gen-read-states-noexpect 100)
(gen-send-event-all "K")
(gen-read-states 10 ["S0","S1","S11"])
(gen-status 30)
(gen-read-states 10 ["S0","S1","S11"])))
(gen-read-states-noexpect 10)))
(defn event-gen-5
"Generates starts and stops and checks joins"
@@ -431,7 +444,7 @@
(event-test "partition-half"
{:nemesis (nemesis/partition-random-halves)
:generator (event-gen-4)
:checker (checker1)}))
:checker (checker4)}))
(defn stop-start-test
"Stops and start nodes checking join is okk."