Why is multithreading Selenium lousy on MacOS?
This blog post might be the start of a series, depending on how much bandwidth I have to investigate this further...
I've been working on a new data problem that has necessitated using Selenium to extract information expediently. To further speed up the process because I'm impatient as hell, I decided to utilize the ThreadPoolExecutor
from the concurrent.futures
in my python script to spin up a bunch of Chrome instances like this:
def setup_driver():
chrome_options = Options()
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--window-size=1920,1080")
return webdriver.Chrome(options=chrome_options)
def search_range(range_tuple):
start_num, end_num, thread_num = range_tuple
driver = setup_driver()
# ...rest of selenium searching / parsing / processing logic
def main():
start_entry = 0000
end_entry = 5000
max_threads = 10
chunk_size = (end_entry- start_entry) // max_threads
ranges = []
for i in range(max_threads):
range_start = start_entry + (i * chunk_size)
range_end = range_start + chunk_size - 1 if i < max_threads else end_entry
ranges.append((range_start, range_end, i))
with ThreadPoolExecutor(max_workers=max_threads) as executor:
futures = [executor.submit(search_range, range_) for range_ in ranges]
for future in futures:
try:
future.result()
except Exception as e:
print(f"Thread crashed with error: {str(e)}")
traceback.print_exc()
The chrome_options
specified are mainly to optimize performance since I am running it headless.
I have two machines with similar(ish, though now I'm doubting this) specs and bought around the same time in 2022:
- Lenovo Thinkpad X1 Carbon 10th Gen - 32 gb RAM (running Ubuntu)
- Macbook Pro - Apple M1 Pro - 32 gb RAM
The M1 performance with the above code was terrible (I think it's the first time I've really heard my fans spin up). Inspecting the performance in htop was practically bewildering, especially when I then looked at the Thinkpad running the exact same script.
MacOS
At startup
Running script
Linux
At startup
Running script
Interesting Observations
- From the start, the number of tasks on Linux is ~1/5th of macOs.
- On macOS the CPU usage on all my cores shot up to 100% almost immediately after the script started running.
- Linux seems to never show a count for
running
processes (though the script is obviously running, and I could see many Chrome processes listed in htop). On macOS this consistently showed up at10
while I was running the script. - the
Load average
was also substantially higher on macOS vs Linux - The memory usage on MacOS was also more than 2x that of Linux
Next Steps?
I don't have time to dig into this right now, but if I manage to revisit it, I think the first step would be to try replicating the results in containers. It looks like there's actually a macOS VM via Docker-OSX, so that might be a good place to start. A bit of googling also revealed this issue, but seeing as it was resolved over 2 years ago, I doubt this is still the problem.
For now I'd say, proceed with caution if you're going to try multithreading with Selenium on a Mac M1 (or use the opportunity to warm your lap in the dead of winter).