Getting started with Twint
A friendly guide to doing your first twitter web scrape on Jupyter Notebook.(Mac)
Whether If this is your first time using twint or you are coming across some issues while using it on your Jupyter Notebook, this might be the solution you are looking for.
First, we want to start from the very beginning. To install twint, you will go to your terminal and use the install command as well as upgrading twint to the current version. This you can take from twint’s GitHub:
!pip3 install twint
!pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
We’re off to a great start! Now, if you go ahead and try out for a web scrape search, you will come across the following error:
This is definitely not good. According to an answer I found on StackOverflow by Mikhail Gerasimov states:
“Event loop running — is an entry point of your async program. It manages running of all coroutines, tasks, callbacks. Running loop while it’s running makes no sense: in some sort it’s like trying to run job executor from same already running job executor.”
To fix this issue we need to install the following library:
!pip install nest-asyncio
And now we’re ready to start web scrapping! To begin, we need to import our libraries to Jupyter Notebook.
Then we can start our first web scrape set up on our notebook. First, we need to apply the nest_asyncio library on the same cell we are setting up our twint search parameters, followed by the configuration of our search. In this case, we are just going to do a simple hashtag search.
Unfortunately, most of the parameters that you specify on twint are not perfect in returning results. In the end, it’s way easier to manipulate the data once you have it than breaking your head in setting up the parameters on twint. In this case, even though I specified that I only want 1 tweet, it returned 71 rows of tweets.
Finally, if we want to store whatever we scrapped on a data frame, in this case, I am using Pandas as my data frame output we want to add a couple of parameters:
c.Hide_output = True
c.Pandas = True
Here “C.Hide_output” tells twint that whenever you run your search, it will not show you the output on your search, then by specifying “C.Pandas” it tells twint to store whatever was scrapped into a pandas data frame.
Now we can finally run our search! we set up our parameters, and we run our search.
As I mentioned before, there is no output after I ran my search. But that doesn’t mean that I am unable to see the data. To see the data twint has gathered, I just need to call it from twint’s storage by using the following line:
twint.storage.panda.Tweets_df
“Tweets_df” is the name of what the data frame is stored on twint, so do not change the name of it. You are able to store the output with whatever name you desire by just assigning the previous code to a variable and voila! You have done your first twitter web scrape!
If you would like to see a full list of parameters you are able to use, you can check out twint’s GitHub for a full list.
Bellow you will find the complete code
import pandas as pd
import twint
import nest_asyncionest_asyncio.apply()
c = twint.Config()
c.Search = "#woman"
c.Limit = 1
c.Pandas = True
c.Hide_output= Truetwint.run.Search(c)twint.storage.panda.Tweets_df