The following example shows a deliberate combination of local and global variables and function parameters, and how global variables work:
def foo(x, y): global a a = 42 x,y = y,x b = 33 b = 17 c = 100 print a,b,x,y a,b,x,y = 1,15,3,4 foo(17,4) print a,b,x,y
The output of the script above looks like this:
42 17 4 17 42 15 3 4
What is the output of the following code?
def increment_counter():
global counter
counter += 1
print(counter)
def get_number_of_elements(rdd):
global counter
counter = 0
rdd.foreach(lambda x: increment_counter())
return counter
It is 0.
My guess is that although the function increment_counter() increments the number of element in rdd correctly, it will not assign the value of counter from function increment_counter() to the counter in function get_number_of_elements(rdd). In other word, it (rdd.foreach) won’t update counter in get_number_of_elements() because global variable is not caught by the functions executed in foreach in pyspark. Thus, the counter returned from get_number_of_element(rdd) is still 0 no matter how many the number of element rdd has.
You can check it out using the following code.
from __future__ import print_function
from pyspark.sql import SparkSession
def increment_counter():
global counter
counter += 1
print(counter)
def get_number_of_elements(rdd):
global counter
counter = 0
rdd.foreach(lambda x: increment_counter())
return counter
if __name__ == "__main__":
# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
lines = spark.read.text('url.txt').rdd.map(lambda r: r[0])
#print(lines.collect())
print(get_number_of_elements(lines))
spark.stop()