Tweak jepsen partition tests having better output"
This commit is contained in:
@@ -537,31 +537,79 @@ What's happening in above chart:
|
||||
|
||||
==== Partition Tolerance
|
||||
We need to always assume that sooner or later things in a cluster will
|
||||
go bad whether that is just a crash of a `Zookeeper` or a state
|
||||
go bad whether it is just a crash of a `Zookeeper` instance, a state
|
||||
machine or a network problem like a `brain split`. Brain split is a
|
||||
situation where existing cluster members are isolated so that only
|
||||
part of a hosts are able to see each others. Usual scenario is that a
|
||||
brain split will create a minority and majority of an ensemble where
|
||||
hosts in a minority cannot participate in an ensemble anymore until
|
||||
network status has been healed.
|
||||
brain split will create a minority and majority partitions of an
|
||||
ensemble where hosts in a minority cannot participate in an ensemble
|
||||
anymore until network status has been healed.
|
||||
|
||||
In this test we will demostrate that a various types of brain-split's in
|
||||
an ensemble will eventually cause n fully synchronized state of all
|
||||
In below tests we will demostrate that various types of brain-split's in
|
||||
an ensemble will eventually cause fully synchronized state of all
|
||||
distributed state machines.
|
||||
|
||||
image::images/sm-tech-partition-half.png[width=500]
|
||||
There are two scenarious having a one straight brain split in a
|
||||
network where where `Zookeeper` and `Statemachine` instances are
|
||||
split in half, assuming each `Statemachine` will connect into a
|
||||
local `Zookeeper` instance:
|
||||
|
||||
* If current zookeeper leader is kept in a majority, all clients
|
||||
connected into majority will keep functioning properly.
|
||||
* If current zookeeper leader is left in minority, all clients will
|
||||
disconnect from it and will try to connect back till previous
|
||||
minority members has succesfully joined back to existing majority
|
||||
ensemble.
|
||||
|
||||
[NOTE]
|
||||
====
|
||||
In our current `jepsen` tests we can't separate zookeeper split brains
|
||||
scenarios between leader left in marojity or minority so we need to
|
||||
run tests multiple time to accomplish this situation.
|
||||
====
|
||||
|
||||
[NOTE]
|
||||
====
|
||||
In below plots we have mapped a state machine error state into a
|
||||
`error` to indicate that `state machine` is in error stete instead or
|
||||
a normal state. Please indicate this when interpering chart states.
|
||||
====
|
||||
|
||||
In this first test we show that when existing zookeeper leader was
|
||||
kept in majority, 3 out of 5 machines will continue as is.
|
||||
|
||||
image::images/sm-tech-partition-half-1.png[width=500]
|
||||
What's happening in above chart:
|
||||
|
||||
* First event `C` is sent to all machine leading a state change to
|
||||
`S211`.
|
||||
* Jepsen nemisis will cause a brain-split which is causing partitions
|
||||
of `n1/n2/n5` and `n3/n4`. Nodes `n3/n4` are left in minority and
|
||||
nodes `n1/n2/n5` constructs a new healthy majority. Nodes in
|
||||
nodes `n1/n2/n5` construct a new healthy majority. Nodes in
|
||||
majority will keep function without problems but nodes in minority
|
||||
will get into error state.
|
||||
* Jepsen will heal network and after some time nodes `n3/n4` will join
|
||||
back into ensemble and synchronize its distributed status.
|
||||
* Lastly event `K1` is sent to all state machines to ensure that ensemble
|
||||
is working properly. This state change will lead back to state
|
||||
`S21`.
|
||||
|
||||
In this second test we show that when existing zookeeper leader was
|
||||
kept in majority, all machines will error out:
|
||||
|
||||
image::images/sm-tech-partition-half-2.png[width=500]
|
||||
What's happening in above chart:
|
||||
|
||||
* First event `C` is sent to all machine leading a state change to
|
||||
`S211`.
|
||||
* Jepsen nemisis will cause a brain-split which is causing partitions
|
||||
so that existing `Zookeeper` leader is kept in minority and all
|
||||
instances are disconnected from ensemble.
|
||||
* Jepsen will heal network and after some time all nodes will join
|
||||
back into ensemble and synchronize its distributed status.
|
||||
* Lastly event `K1` is sent to all state machines to ensure that ensemble
|
||||
is working properly. This state change will lead back to state
|
||||
`S21`.
|
||||
|
||||
==== Crash and Join Tolerance
|
||||
In this test we will demostrate that killing existing state machine
|
||||
|
||||
BIN
docs/src/reference/asciidoc/images/sm-tech-partition-half-1.png
Normal file
BIN
docs/src/reference/asciidoc/images/sm-tech-partition-half-1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 33 KiB |
BIN
docs/src/reference/asciidoc/images/sm-tech-partition-half-2.png
Normal file
BIN
docs/src/reference/asciidoc/images/sm-tech-partition-half-2.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 34 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 31 KiB |
@@ -13,7 +13,7 @@
|
||||
[gnuplot.core :as g]))
|
||||
|
||||
(def nodetovalue {"n1" 1 "n2" 2 "n3" 3 "n4" 4 "n5" 5 })
|
||||
(def statetovalue {"S11" 1 "S12" 2 "S211" 3 "S212" 4 })
|
||||
(def statetovalue {"S11" 1 "S12" 2 "S211" 3 "S212" 4 "error" 5 })
|
||||
(def variabletovalue {"v1" 1 "v2" 2 "v3" 3 "v4" 4 "v5" 5 "v6" 6 "v7" 7 "v8" 8})
|
||||
|
||||
(defn shiftvalue
|
||||
@@ -76,6 +76,20 @@
|
||||
(vector
|
||||
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (get nodetovalue (name (last (keys (get value :value))))) "X")) ) [] (vec data)))))
|
||||
|
||||
(defn extract-plot-data6
|
||||
[history]
|
||||
(let [data
|
||||
(->> history
|
||||
(filter #(= :ok (:type %)))
|
||||
(filter #(= :statesnoexpect (:f %)))
|
||||
(group-by :process))]
|
||||
(vector
|
||||
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 0))
|
||||
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 1))
|
||||
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 2))
|
||||
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 3))
|
||||
(reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 4)))))
|
||||
|
||||
(defn plot1!
|
||||
[test model history]
|
||||
|
||||
@@ -172,6 +186,38 @@
|
||||
output-path)
|
||||
{:valid? true})
|
||||
|
||||
(defn plot4!
|
||||
[test model history]
|
||||
|
||||
(let [output-path (.getCanonicalPath (store/path! test "states.png"))]
|
||||
(g/raw-plot! [[:set :key :outside]
|
||||
[:set :style :textbox :opaque]
|
||||
[:set :terminal :qt :size (keyword "900,450")]
|
||||
[:set :yrange (keyword "[0.5:5.5]")]
|
||||
[:set :y2range (keyword "[0.5:5.5]")]
|
||||
[:set :xtics :format "%h\nns"]
|
||||
[:set :xlabel "elapsed time"]
|
||||
[:set :ylabel "states in nodes"]
|
||||
[:set :y2label "events via nodes"]
|
||||
[:set :ytics 1]
|
||||
[:set :ytics (keyword "('S21' 1, 'S22' 2, 'S211' 3, 'S212' 4, 'error' 5)")]
|
||||
[:set :ytics :nomirror]
|
||||
[:set :y2tics 1]
|
||||
[:set :y2tics (keyword "('n1' 1, 'n2' 2, 'n3' 3, 'n4' 4, 'n5' 5)")]
|
||||
[:plot
|
||||
(g/list ["-" :title "states n1" :with :steps :lw :3]
|
||||
["-" :title "states n2" :with :steps :lw :3]
|
||||
["-" :title "states n3" :with :steps :lw :3]
|
||||
["-" :title "states n4" :with :steps :lw :3]
|
||||
["-" :title "states n5" :with :steps :lw :3]
|
||||
["-" :title "events" :with :labels :center :boxed :font ",15" :axis :x1y2]
|
||||
)]]
|
||||
(into
|
||||
(extract-plot-data6 history)
|
||||
(extract-plot-data2 history)))
|
||||
output-path)
|
||||
{:valid? true})
|
||||
|
||||
(defn checker1
|
||||
"Constructs a Jepsen checker."
|
||||
[]
|
||||
@@ -192,3 +238,10 @@
|
||||
(reify Checker
|
||||
(check [_ test model history]
|
||||
(if (env :plot) (plot3! test model history) {:valid? true}))))
|
||||
|
||||
(defn checker4
|
||||
"Constructs a Jepsen checker."
|
||||
[]
|
||||
(reify Checker
|
||||
(check [_ test model history]
|
||||
(if (env :plot) (plot4! test model history) {:valid? true}))))
|
||||
|
||||
@@ -18,6 +18,7 @@
|
||||
[spring-statemachine-jepsen.checker :refer [checker1]]
|
||||
[spring-statemachine-jepsen.checker :refer [checker2]]
|
||||
[spring-statemachine-jepsen.checker :refer [checker3]]
|
||||
[spring-statemachine-jepsen.checker :refer [checker4]]
|
||||
[jepsen.checker.timeline :as timeline]
|
||||
[jepsen.control.net :as net]
|
||||
[jepsen.os.debian :as debian]
|
||||
@@ -41,18 +42,20 @@
|
||||
(http/post (str "http://" (name node) ":8080/event")
|
||||
{:form-params {:id (str event) :testVariable value}}))
|
||||
|
||||
(defn sm-read-states
|
||||
"Reading states from a state machine"
|
||||
[node]
|
||||
(let [response (http/get (str "http://" (name node) ":8080/states") {:as :json})]
|
||||
(get response :body)))
|
||||
|
||||
(defn sm-read-status-ok?
|
||||
"Read status and check that there is no errors"
|
||||
[node]
|
||||
(let [response (http/get (str "http://" (name node) ":8080/status") {:as :json})]
|
||||
(= (get (get response :body) :hasStateMachineError) false)))
|
||||
|
||||
(defn sm-read-states
|
||||
"Reading states from a state machine"
|
||||
[node]
|
||||
(if (sm-read-status-ok? node)
|
||||
(let [response (http/get (str "http://" (name node) ":8080/states") {:as :json})]
|
||||
(get response :body))
|
||||
(vec ["error"])))
|
||||
|
||||
(defn sm-read-state-variable
|
||||
"Read status and check that there is no errors"
|
||||
[node key]
|
||||
@@ -82,8 +85,7 @@
|
||||
(Thread/sleep 1000)
|
||||
(if (sm-read-status-ok? node) false true)
|
||||
(catch Exception e true))
|
||||
(recur))))
|
||||
)
|
||||
(recur)))))
|
||||
|
||||
(defn start!
|
||||
[node]
|
||||
@@ -155,6 +157,11 @@
|
||||
)
|
||||
(catch RuntimeException e
|
||||
(assoc op :type :fail :value (.getMessage e))))
|
||||
:statesnoexpect (try
|
||||
(let [res (sm-read-states client)]
|
||||
(assoc op :type :ok :value (vec res)))
|
||||
(catch RuntimeException e
|
||||
(assoc op :type :fail :value (.getMessage e))))
|
||||
:event (try
|
||||
(sm-send-event client (:e op))
|
||||
(assoc op :type :ok :value (:e op))
|
||||
@@ -187,6 +194,17 @@
|
||||
:f :states
|
||||
:s expect}]))))))
|
||||
|
||||
(defn gen-read-states-noexpect
|
||||
"Read states n times."
|
||||
[times]
|
||||
(gen/clients
|
||||
(gen/each
|
||||
(gen/seq
|
||||
(take (* times 2)
|
||||
(cycle [(gen/sleep 1)
|
||||
{:type :invoke
|
||||
:f :statesnoexpect}]))))))
|
||||
|
||||
(defn gen-send-event
|
||||
"Send event one time to random node."
|
||||
[event]
|
||||
@@ -307,24 +325,19 @@
|
||||
"Generates event and checks states while splitting network"
|
||||
[]
|
||||
(gen/phases
|
||||
(gen-read-states 5 ["S0","S1","S11"])
|
||||
(gen-read-states-noexpect 10)
|
||||
(gen-send-event-all "C")
|
||||
(gen-read-states 5 ["S0","S2","S21","S211"])
|
||||
(gen-status 2)
|
||||
(gen-read-states-noexpect 10)
|
||||
;start nemesis, split network
|
||||
(gen/nemesis
|
||||
(gen/once {:type :info :f :start}))
|
||||
(gen-read-states 15 ["S0","S2","S21","S211"])
|
||||
(gen-status 5)
|
||||
(gen-read-states-noexpect 15)
|
||||
;stop nemesis, heal network
|
||||
(gen/nemesis
|
||||
(gen/once {:type :info :f :stop}))
|
||||
(gen-status 5)
|
||||
(gen-read-states 15 ["S0","S2","S21","S211"])
|
||||
(gen-read-states-noexpect 100)
|
||||
(gen-send-event-all "K")
|
||||
(gen-read-states 10 ["S0","S1","S11"])
|
||||
(gen-status 30)
|
||||
(gen-read-states 10 ["S0","S1","S11"])))
|
||||
(gen-read-states-noexpect 10)))
|
||||
|
||||
(defn event-gen-5
|
||||
"Generates starts and stops and checks joins"
|
||||
@@ -431,7 +444,7 @@
|
||||
(event-test "partition-half"
|
||||
{:nemesis (nemesis/partition-random-halves)
|
||||
:generator (event-gen-4)
|
||||
:checker (checker1)}))
|
||||
:checker (checker4)}))
|
||||
|
||||
(defn stop-start-test
|
||||
"Stops and start nodes checking join is okk."
|
||||
|
||||
Reference in New Issue
Block a user