Build awareness and adoption for your software startup with Circuit.

How to Scrape Everything from Telegram Using Python

Learn how to get API keys for Telegram, extract group members, and scrape Telegram group or channel comments

Telegram is one of the best communications apps around the world. People usually use Telegram to manage their communities and promotions.

In this tutorial, you will learn:

  • How to get API keys for Telegram
  • How to extract Group members.
  • How to Scrape Telegram Group or Channel Comments

How to get API keys for Telegram

Let’s get your telegram api_id and api_hash values**.** Go to my.telegram.org you will see this are:

Log in here to manage your apps using Telegram API or delete your account. Enter your number and we will send you a confirmation code via Telegram (not SMS).

After login, you will see this picture:

Click on API development tools and fill in the required fields. You can choose any name for your app. For example, I chose demo1. After submitting, you will receive App api_id and api_hash like on the picture.

Please save your api_id and api_hash values somewhere.**** You will use these api_id and api_hash to login to Telegram API.

How to extract Group members

Telethon is an asyncio Python 3 MTProto library to interact with Telegram’s API as a user or through a bot account. This library is meant to make it easy for you to write Python programs that can interact with Telegram. Think of it as a wrapper that has already done the heavy job for you, so you can focus on developing an application.

Let’s install telethon and pandas using pip:

pip install telethon
pip install pandas

Let’s create requirements.txt file

pip freeze > requirements.txt
numpy==1.23.1
pandas==1.4.3
pyaes==1.6.1
pyasn1==0.4.8
python-dateutil==2.8.2
pytz==2022.1
rsa==4.9
six==1.16.0
Telethon==1.24.0

After installing telethon and pandas library. Here is the all code to get users information from group and put them in a csv file. Not: You have to past your api_id, api_hash, phone and username.

import configparser
import json
import asyncio
import pandas as pd

from telethon import TelegramClient
from telethon.errors import SessionPasswordNeededError
from telethon.tl.functions.channels import GetParticipantsRequest
from telethon.tl.types import ChannelParticipantsSearch
from telethon.tl.types import (
    PeerChannel
)

# Setting configuration values
api_id = 3..1
api_hash = ''30..e3''
api_hash = str(api_hash)
phone = ''+your_phone''
username = "your_telegram_user_name"

# Create the client and connect
client = TelegramClient(username, api_id, api_hash)

async def main(phone):
    await client.start()
    print("Client Created")
    # Ensure you''re authorized
    if await client.is_user_authorized() == False:
        await client.send_code_request(phone)
        try:
            await client.sign_in(phone, input(''Enter the code: ''))
        except SessionPasswordNeededError:
            await client.sign_in(password=input(''Password: ''))

    me = await client.get_me()

    user_input_channel = input("enter entity(telegram URL or entity id):")

    if user_input_channel.isdigit():
        entity = PeerChannel(int(user_input_channel))
    else:
        entity = user_input_channel

    my_channel = await client.get_entity(entity)

    offset = 0
    limit = 100
    all_participants = []

    while True:
        participants = await client(GetParticipantsRequest(
            my_channel, ChannelParticipantsSearch(''), offset, limit,
            hash=0
        ))
        if not participants.users:
            break
        all_participants.extend(participants.users)
        offset += len(participants.users)

    all_user_details = []
    for participant in all_participants:
        all_user_details.append(
            {"user.id": participant.id, 
            "first_name": participant.first_name, 
            "last_name": participant.last_name,
            "username": participant.username, 
            "phone": "a_"+str(participant.phone),
            "user.access_hash": participant.access_hash,
            "is_bot": participant.bot})

    #with open(''user_data.json'', ''w'') as outfile:
    #    json.dump(all_user_details, outfile)

    filename = user_input_channel.replace("/","_").replace(":","")
    df = pd.DataFrame(all_user_details)
    df.to_csv(''withphone_''+filename+''.csv'', index = True) # column names ekler

with client:
    client.loop.run_until_complete(main(phone))

#t.me/suleymanARSLANTURK
#https://t.me/elasticELK

When you run the code you have to paste the group name URL or entity id. Then you will get all users'' information from the group in a CSV file. As you see below picture:


How to Scrape Telegram Group or Channel Comments

In this section, We will scrape messages from Telegram Groups or Channels. To do this we have to install telethon package and only the installation telethon package is enough to do that.

Let’s install telethon using pip:

pip install telethon

Let’s create requirements.txt file

pip freeze > requirements.txt
pyaes==1.6.1
pyasn1==0.4.8
rsa==4.9
Telethon==1.24.0

Here is the simplest code:

from telethon.sync import TelegramClient
import pandas as pd

api_id = 3..1
api_hash = ''30..e3''
phone = ''+9..6''
username = "H..k"

data = [] 
with TelegramClient(username, api_id, api_hash) as client:
    for message in client.iter_messages("https://t.me/clickhouse_en"):
        print(message.sender_id, '':'', message.text, message.date) 
        data.append([message.sender_id, message.text, message.date, message.id, message.post_author, message.views, message.peer_id.channel_id ])

df = pd.DataFrame(data, columns=["message.sender_id", "message.text"," message.date", "message.id",  "message.post_author", "message.views", "message.peer_id.channel_id" ]) # creates a new dataframe
df.to_csv(''filename.csv'', encoding=''utf-8'')

Scraped Messages:


Here is the code block to scrape all comments from Telegram Group or Channels. Not: You have to paste your api_id, api_hash, phone, and username.

import configparser
import json
import asyncio
from datetime import date, datetime

from telethon import TelegramClient
from telethon.errors import SessionPasswordNeededError
from telethon.tl.functions.messages import (GetHistoryRequest)
from telethon.tl.types import (
    PeerChannel
)


# some functions to parse json date
class DateTimeEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, datetime):
            return o.isoformat()

        if isinstance(o, bytes):
            return list(o)

        return json.JSONEncoder.default(self, o)


# Reading Configs
config = configparser.ConfigParser()
config.read("config.ini")

# Setting configuration values
api_id = 32...31 #config[''Telegram''][''api_id'']
api_hash = ''309...823e3''  #config[''Telegram''][''api_hash'']

api_hash = str(api_hash)

phone = ''+90..76''#config[''Telegram''][''phone'']
username = "Heyhak"#config[''Telegram''][''username'']

# Create the client and connect
client = TelegramClient(phone, api_id, api_hash)

async def main(phone):
    await client.start()
    print("Client Created")
    # Ensure you''re authorized
    if await client.is_user_authorized() == False:
        await client.send_code_request(phone)
        try:
            await client.sign_in(phone, input(''Enter the code: ''))
        except SessionPasswordNeededError:
            await client.sign_in(password=input(''Password: ''))

    me = await client.get_me()

    user_input_channel = input(''enter entity(telegram URL or entity id):'')

    if user_input_channel.isdigit():
        entity = PeerChannel(int(user_input_channel))
    else:
        entity = user_input_channel

    my_channel = await client.get_entity(entity)

    offset_id = 0
    limit = 100
    all_messages = []
    total_messages = 0
    total_count_limit = 0

    while True:
        print("Current Offset ID is:", offset_id, "; Total Messages:", total_messages)
        history = await client(GetHistoryRequest(
            peer=my_channel,
            offset_id=offset_id,
            offset_date=None,
            add_offset=0,
            limit=limit,
            max_id=0,
            min_id=0,
            hash=0
        ))
        if not history.messages:
            break
        messages = history.messages
        for message in messages:
            all_messages.append(message.to_dict())
        offset_id = messages[len(messages) - 1].id
        total_messages = len(all_messages)
        if total_count_limit != 0 and total_messages >= total_count_limit:
            break

    with open(''channel_messages.json'', ''w'') as outfile:
        json.dump(all_messages, outfile, cls=DateTimeEncoder)

with client:
    client.loop.run_until_complete(main(phone))

When you run code you have to paste the group or channel name URL or entity id.

After running the code, you will get all messages/comments from the group or channel in a JSON file. As you see below picture.

Messages in a JSON file:


Conclusion

Let’s recap what we have learned In this tutorial, We started off with how to get api_id and api_hash from Telegram. Then installed telethon package. Finally, get scraped users from the Telegram group and scraped messages from the Telegram group or channel.




Continue Learning