diff --git a/docs/src/reference/asciidoc/appendix.adoc b/docs/src/reference/asciidoc/appendix.adoc index 144a94b2..b26a464b 100644 --- a/docs/src/reference/asciidoc/appendix.adoc +++ b/docs/src/reference/asciidoc/appendix.adoc @@ -537,31 +537,79 @@ What's happening in above chart: ==== Partition Tolerance We need to always assume that sooner or later things in a cluster will -go bad whether that is just a crash of a `Zookeeper` or a state +go bad whether it is just a crash of a `Zookeeper` instance, a state machine or a network problem like a `brain split`. Brain split is a situation where existing cluster members are isolated so that only part of a hosts are able to see each others. Usual scenario is that a -brain split will create a minority and majority of an ensemble where -hosts in a minority cannot participate in an ensemble anymore until -network status has been healed. +brain split will create a minority and majority partitions of an +ensemble where hosts in a minority cannot participate in an ensemble +anymore until network status has been healed. -In this test we will demostrate that a various types of brain-split's in -an ensemble will eventually cause n fully synchronized state of all +In below tests we will demostrate that various types of brain-split's in +an ensemble will eventually cause fully synchronized state of all distributed state machines. -image::images/sm-tech-partition-half.png[width=500] +There are two scenarious having a one straight brain split in a +network where where `Zookeeper` and `Statemachine` instances are +split in half, assuming each `Statemachine` will connect into a +local `Zookeeper` instance: +* If current zookeeper leader is kept in a majority, all clients + connected into majority will keep functioning properly. +* If current zookeeper leader is left in minority, all clients will + disconnect from it and will try to connect back till previous + minority members has succesfully joined back to existing majority + ensemble. + +[NOTE] +==== +In our current `jepsen` tests we can't separate zookeeper split brains +scenarios between leader left in marojity or minority so we need to +run tests multiple time to accomplish this situation. +==== + +[NOTE] +==== +In below plots we have mapped a state machine error state into a +`error` to indicate that `state machine` is in error stete instead or +a normal state. Please indicate this when interpering chart states. +==== + +In this first test we show that when existing zookeeper leader was +kept in majority, 3 out of 5 machines will continue as is. + +image::images/sm-tech-partition-half-1.png[width=500] What's happening in above chart: * First event `C` is sent to all machine leading a state change to `S211`. * Jepsen nemisis will cause a brain-split which is causing partitions of `n1/n2/n5` and `n3/n4`. Nodes `n3/n4` are left in minority and - nodes `n1/n2/n5` constructs a new healthy majority. Nodes in + nodes `n1/n2/n5` construct a new healthy majority. Nodes in majority will keep function without problems but nodes in minority will get into error state. * Jepsen will heal network and after some time nodes `n3/n4` will join back into ensemble and synchronize its distributed status. +* Lastly event `K1` is sent to all state machines to ensure that ensemble + is working properly. This state change will lead back to state + `S21`. + +In this second test we show that when existing zookeeper leader was +kept in majority, all machines will error out: + +image::images/sm-tech-partition-half-2.png[width=500] +What's happening in above chart: + +* First event `C` is sent to all machine leading a state change to + `S211`. +* Jepsen nemisis will cause a brain-split which is causing partitions + so that existing `Zookeeper` leader is kept in minority and all + instances are disconnected from ensemble. +* Jepsen will heal network and after some time all nodes will join + back into ensemble and synchronize its distributed status. +* Lastly event `K1` is sent to all state machines to ensure that ensemble + is working properly. This state change will lead back to state + `S21`. ==== Crash and Join Tolerance In this test we will demostrate that killing existing state machine diff --git a/docs/src/reference/asciidoc/images/sm-tech-partition-half-1.png b/docs/src/reference/asciidoc/images/sm-tech-partition-half-1.png new file mode 100644 index 00000000..73e1c7ca Binary files /dev/null and b/docs/src/reference/asciidoc/images/sm-tech-partition-half-1.png differ diff --git a/docs/src/reference/asciidoc/images/sm-tech-partition-half-2.png b/docs/src/reference/asciidoc/images/sm-tech-partition-half-2.png new file mode 100644 index 00000000..a046ffef Binary files /dev/null and b/docs/src/reference/asciidoc/images/sm-tech-partition-half-2.png differ diff --git a/docs/src/reference/asciidoc/images/sm-tech-partition-half.png b/docs/src/reference/asciidoc/images/sm-tech-partition-half.png deleted file mode 100644 index a8ccb099..00000000 Binary files a/docs/src/reference/asciidoc/images/sm-tech-partition-half.png and /dev/null differ diff --git a/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/checker.clj b/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/checker.clj index d76965c8..3e61cc0e 100644 --- a/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/checker.clj +++ b/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/checker.clj @@ -13,7 +13,7 @@ [gnuplot.core :as g])) (def nodetovalue {"n1" 1 "n2" 2 "n3" 3 "n4" 4 "n5" 5 }) -(def statetovalue {"S11" 1 "S12" 2 "S211" 3 "S212" 4 }) +(def statetovalue {"S11" 1 "S12" 2 "S211" 3 "S212" 4 "error" 5 }) (def variabletovalue {"v1" 1 "v2" 2 "v3" 3 "v4" 4 "v5" 5 "v6" 6 "v7" 7 "v8" 8}) (defn shiftvalue @@ -76,6 +76,20 @@ (vector (reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (get nodetovalue (name (last (keys (get value :value))))) "X")) ) [] (vec data))))) +(defn extract-plot-data6 + [history] + (let [data + (->> history + (filter #(= :ok (:type %))) + (filter #(= :statesnoexpect (:f %))) + (group-by :process))] + (vector + (reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 0)) + (reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 1)) + (reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 2)) + (reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 3)) + (reduce-kv (fn [vec key value] (conj vec (vector (get value :time) (shiftvalue (get statetovalue (last (get value :value))) (get value :process) ))) ) [] (get data 4))))) + (defn plot1! [test model history] @@ -172,6 +186,38 @@ output-path) {:valid? true}) +(defn plot4! + [test model history] + + (let [output-path (.getCanonicalPath (store/path! test "states.png"))] + (g/raw-plot! [[:set :key :outside] + [:set :style :textbox :opaque] + [:set :terminal :qt :size (keyword "900,450")] + [:set :yrange (keyword "[0.5:5.5]")] + [:set :y2range (keyword "[0.5:5.5]")] + [:set :xtics :format "%h\nns"] + [:set :xlabel "elapsed time"] + [:set :ylabel "states in nodes"] + [:set :y2label "events via nodes"] + [:set :ytics 1] + [:set :ytics (keyword "('S21' 1, 'S22' 2, 'S211' 3, 'S212' 4, 'error' 5)")] + [:set :ytics :nomirror] + [:set :y2tics 1] + [:set :y2tics (keyword "('n1' 1, 'n2' 2, 'n3' 3, 'n4' 4, 'n5' 5)")] + [:plot + (g/list ["-" :title "states n1" :with :steps :lw :3] + ["-" :title "states n2" :with :steps :lw :3] + ["-" :title "states n3" :with :steps :lw :3] + ["-" :title "states n4" :with :steps :lw :3] + ["-" :title "states n5" :with :steps :lw :3] + ["-" :title "events" :with :labels :center :boxed :font ",15" :axis :x1y2] + )]] + (into + (extract-plot-data6 history) + (extract-plot-data2 history))) + output-path) + {:valid? true}) + (defn checker1 "Constructs a Jepsen checker." [] @@ -192,3 +238,10 @@ (reify Checker (check [_ test model history] (if (env :plot) (plot3! test model history) {:valid? true})))) + +(defn checker4 + "Constructs a Jepsen checker." + [] + (reify Checker + (check [_ test model history] + (if (env :plot) (plot4! test model history) {:valid? true})))) diff --git a/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/core.clj b/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/core.clj index 22a651f5..d971c8e9 100644 --- a/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/core.clj +++ b/jepsen/spring-statemachine-jepsen/src/spring_statemachine_jepsen/core.clj @@ -18,6 +18,7 @@ [spring-statemachine-jepsen.checker :refer [checker1]] [spring-statemachine-jepsen.checker :refer [checker2]] [spring-statemachine-jepsen.checker :refer [checker3]] + [spring-statemachine-jepsen.checker :refer [checker4]] [jepsen.checker.timeline :as timeline] [jepsen.control.net :as net] [jepsen.os.debian :as debian] @@ -41,18 +42,20 @@ (http/post (str "http://" (name node) ":8080/event") {:form-params {:id (str event) :testVariable value}})) -(defn sm-read-states - "Reading states from a state machine" - [node] - (let [response (http/get (str "http://" (name node) ":8080/states") {:as :json})] - (get response :body))) - (defn sm-read-status-ok? "Read status and check that there is no errors" [node] (let [response (http/get (str "http://" (name node) ":8080/status") {:as :json})] (= (get (get response :body) :hasStateMachineError) false))) +(defn sm-read-states + "Reading states from a state machine" + [node] + (if (sm-read-status-ok? node) + (let [response (http/get (str "http://" (name node) ":8080/states") {:as :json})] + (get response :body)) + (vec ["error"]))) + (defn sm-read-state-variable "Read status and check that there is no errors" [node key] @@ -82,8 +85,7 @@ (Thread/sleep 1000) (if (sm-read-status-ok? node) false true) (catch Exception e true)) - (recur)))) - ) + (recur))))) (defn start! [node] @@ -155,6 +157,11 @@ ) (catch RuntimeException e (assoc op :type :fail :value (.getMessage e)))) + :statesnoexpect (try + (let [res (sm-read-states client)] + (assoc op :type :ok :value (vec res))) + (catch RuntimeException e + (assoc op :type :fail :value (.getMessage e)))) :event (try (sm-send-event client (:e op)) (assoc op :type :ok :value (:e op)) @@ -187,6 +194,17 @@ :f :states :s expect}])))))) +(defn gen-read-states-noexpect + "Read states n times." + [times] + (gen/clients + (gen/each + (gen/seq + (take (* times 2) + (cycle [(gen/sleep 1) + {:type :invoke + :f :statesnoexpect}])))))) + (defn gen-send-event "Send event one time to random node." [event] @@ -307,24 +325,19 @@ "Generates event and checks states while splitting network" [] (gen/phases - (gen-read-states 5 ["S0","S1","S11"]) + (gen-read-states-noexpect 10) (gen-send-event-all "C") - (gen-read-states 5 ["S0","S2","S21","S211"]) - (gen-status 2) + (gen-read-states-noexpect 10) ;start nemesis, split network (gen/nemesis (gen/once {:type :info :f :start})) - (gen-read-states 15 ["S0","S2","S21","S211"]) - (gen-status 5) + (gen-read-states-noexpect 15) ;stop nemesis, heal network (gen/nemesis (gen/once {:type :info :f :stop})) - (gen-status 5) - (gen-read-states 15 ["S0","S2","S21","S211"]) + (gen-read-states-noexpect 100) (gen-send-event-all "K") - (gen-read-states 10 ["S0","S1","S11"]) - (gen-status 30) - (gen-read-states 10 ["S0","S1","S11"]))) + (gen-read-states-noexpect 10))) (defn event-gen-5 "Generates starts and stops and checks joins" @@ -431,7 +444,7 @@ (event-test "partition-half" {:nemesis (nemesis/partition-random-halves) :generator (event-gen-4) - :checker (checker1)})) + :checker (checker4)})) (defn stop-start-test "Stops and start nodes checking join is okk."