Till senaste kommentaren

Historical realtime data for a specific route

Hello Trafiklab team,

I would like to plot the history (going back of 2/3 years) of the scheduled vs real departures for few, around 5, selected routes.

After reading the description I decided to use the KoDA database as it's the one that contains historical real-time data, while GTFS only has the data at the moment it's queried.
example:
url = f"https://api.koda.trafiklab.se/KoDa/api/v2/gtfs-static/{COMPANY}?date={DATE}&key={KEY}"

I used KoDA static to get route_id, trip_id, stop_id and associated planned arrival/departure times.
To get the real departure times, I used the Koda real time:
url = f"https://api.koda.trafiklab.se/KoDa/api/v2/gtfs-rt/{OPERATOR}/{FEED}?date={DATE}&key={KEY} & hour={TIME}", where TIME = [0-23], ​FEED="TripUpdates"

Even for a single operator and day, this query results in very big unzipped files as it contains multiple updates per hour for all stations and routes.

I tried to add &route_id={ROUTE_ID} in the url but it didn't seem to work.

My question is:

0)Is KoDA the optimal dataset to achieve my goal?
1) Is it possible to limit the queried data by only including specific stops and trip ids? It didnt seem to work for me.
2) ​Is it possible to just ​query the last update for each ​hour ​(as it should be the ​closest to the real ​departure) instead of all the available data ​?

​Thank ​you for your attention.




Mat

Kommentarer

  • Hi Mat,

    KoDA is the right (and only) source for this kind of data. Koda archives all realtime data, which for delay information means a snapshot is taken every 15 seconds. This way, you can see exactly how much delay a trip had at a given time. 

    You need to download the complete dataset for a given day, then you can process and filter which data you are interested in for your use case.In your case you can choose to only read one realtime file for each hour, but since passed points are not kept in the realtime data for more than 10 minutes, you risk not having the actual passing time at all points.

    Regards,
    Bert
    Bert på Trafiklab
  • Hello Bert,

    Thanks for the information. 
    As mentioned my aim was to get some statistics of delays for a few route_ids over the course of the past 1-2 years.
    When I tried to download the regional realtime traffic data from KoDA gtfs-rt for a certain day the query took 1hr before being ready for download: do I need to run a query for for each day or I can query multiple days at the same time?

    Best regards

    Mat
  • Hi Mat,

    Downloading multiple files at the same time can lead to files getting stuck when the server becomes overloaded. We recommended to only download one file at once. Many of the daily archives have been created ​before, ​and ​may ​therefore download faster than hourly archives.

    Regards,
    Bert
    Bert på Trafiklab
  • Hello Bert,

    Thanks for the quick and accurate reply.
    If I understand correctly, I need to download the zipped files day-by-day but some of them have been already processed by someone else so the download is faster as the file already exists.

    From my earlier testing using resp = requests.head(url):
    -If the data is not processed: I get a status code 202 which then becomes 200 once the file is ready for download
    -If the data is already processed: I get status code 200 directly

    I have a follow up question:
    I wanted to use a HEAD request to check which files have been already created (code 200) and which files need to be created (code 202).
    Does the HEAD request itself start the job on your server (meaning that I cannot send multiple HEAD requests for different days in a rapid succession) or does it just give info about data availability (therefore I can quickly scan how many days have been already processed in the past years)?

    Thank you again.
    Mat
  • The head request will start the job, but you can use it in a script to keep polling using HEAD requests, then download the file once you received a HTTP 200 resposne code. Which operator and feed type are you interested in?

    Regards,
    Bert
    Bert på Trafiklab
  • Hello Bert,

    Ah I see... so I need to:
    ->loop over the days from 2024 to today
    ->create request url for day X
    ->send head request
    if head request code is 200:
        proceed
    if head request code is 202:
        periodically send head requests to check update on the job until code is 200
    ->download zip file

    repeat for all days
      
    "Which operator and feed type are you interested in?"

    I'm looking at öresundståg historical data so I'm using skane and TripUpdates.
    As per previous feedback it is not possible to filter the data more than this.

    Best regards
    Mat
  • Hi Mat,

    For 2024, 360 files have already been generated. For 2025, 261 files have been generated already, and for 2026, 46 have been generated.

    Most of the downloads should go quite fast.

    Regards,
    Bert
    Bert på Trafiklab
  • Hello Bert,

    Oh wow I guess I got unlucky when I was testing it for some days in February 2026 as most of the days were giving ​202 status code.

    I will finish my script and let you know how it goes.

    Thank you very much for the support on this.

    Best regards,
    Mat

Kommentera eller skriv ett nytt inlägg

Ditt namn och inlägg kan ses av alla. Din e-post visas aldrig publikt.