pd.read_sql and pd.read_sql_query hang upon execution, how to troubleshoot this?

fonehe5161 · Feb 22, 2023

I am currently trying to read from a local MySQL database table on my computer. This table has about 2M rows of data. So when I call read_sql() or read_sql_query() I tried passing it chunksize = 1000, chunksize=10 and then chunksize=1, all of which in my debug session in PyCharm, it just disappears and hangs. I have let it run for some 10 minutes and then killed the execution.

I have no idea what is going on or how to troubleshoot this? Any suggestions? Here is my code:

` try: # Connect to local database database_uri = 'mysql+pymysql://root:1234@localhost:3306' localEngine = sqlalchemy.create_engine(database_uri)

Code:

    with localEngine.connect() as conn_local:

        result = conn_local.execute(text("USE ConsumerExpenditures10;"))

        result = conn_local.execute(text("""
                                    CREATE TABLE IF NOT EXISTS LOADING_TABLE (
                                        EXPENDITURE_ID varchar(11) PRIMARY KEY NOT NULL,
                                        HOUSEHOLD_ID VARCHAR(10) NOT NULL,
                                        YEAR YEAR NOT NULL,
                                        MONTH INT(11) NOT NULL,
                                        PRODUCT_CODE VARCHAR(155) NOT NULL,
                                        COST DOUBLE NOT NULL,
                                        GIFT INT NOT NULL,
                                        IS_TRAINING INT(255) NOT NULL,
                                        MARITAL VARCHAR(25),
                                        SEX VARCHAR(25),
                                        AGE INT,
                                        WORK_STATUS VARCHAR(25),
                                        INCOME_RANK double,
                                        INCOME_RANK_1 double,
                                        INCOME_RANK_2 double,
                                        INCOME_RANK_3 double,
                                        INCOME_RANK_4 double,
                                        INCOME_RANK_5 double,
                                        INCOME_RANK_MEAN double,
                                        FEDERAL_FUNDS_TARGET_RATE double,
                                        FEDERAL_FUNDS_UPPER_TARGET double,
                                        FEDERAL_FUNDS_LOWER_TARGET double,
                                        EFFECTIVE_FEDERAL_FUNDS_RATE double,
                                        REAL_GDP double,
                                        UNEMPLOYMENT_RATE double,
                                        INFLATION_RATE double,
                                        CPI double)
                                    """))

        # TRANSFORMATION #9 - Create comprehensive table of GDP, CPI and Consumer Expenditures Data
        # with each row being one Consumer Expenditure Purchase
        query = text("""select e.expenditure_id, e.household_id, e.year, e.month, e.product_code,
                        e.cost, e.gift, e.is_training,
                        hm.marital, hm.sex, hm.age, hm.work_status, h.income_rank,
                        h.income_rank_1, h.income_rank_2, h.income_rank_3, h.income_rank_4,
                        h.income_rank_5, h.income_rank_mean, g.FEDERAL_FUNDS_TARGET_RATE,
                        g.FEDERAL_FUNDS_UPPER_TARGET, g.FEDERAL_FUNDS_LOWER_TARGET,
                        g.EFFECTIVE_FEDERAL_FUNDS_RATE,
                        g.REAL_GDP, g.UNEMPLOYMENT_RATE, g.INFLATION_RATE, c.CPI
                        from expenditures e
                        inner join household_members hm
                        on hm.household_id = e.HOUSEHOLD_ID
                        inner join households h
                        on h.household_id = hm.HOUSEHOLD_ID
                        inner join gdp g
                        on g.gdp_year = e.`YEAR`
                        inner join cpi c
                        on c.CPI_YEAR = g.gdp_year""")

        # https://pythonspeed.com/articles/pandas-sql-chunking/
        # https://stackoverflow.com/questions/69711599/pandas-read-sql-from-ms-sql-gets-stuck-for-queries-with-275-chars-in-linux
        # Takes too long to execute this query: df_final_table = pd.read_sql(query, conn_local)
        # So we have to do it in chunks to load into a pandas dataframe and then write that to the loading_table
        for chunk_dataframe in pd.read_sql_query(query, conn_local, chunksize=10):
            print(
                f"Got dataframe w/{len(chunk_dataframe)} rows"
            )

            # write this dataframe chunk into the LOADING_TABLE
            result = chunk_dataframe.to_sql(name='LOADING_TABLE', con=conn_local, if_exists='append', index=False)

    
        conn_local.commit()`

Do you have any suggestions on what I can do trouble shoot this?

tagey73068 · Feb 22, 2023

Are these queries running as part of a greater execution process? Maybe you are getting a deadlock somewhere. Also due to an error happening initially, maybe Python exited before closing its connections fully and now your connections to the DB are full. If all of this fails, try running these queries in mysql manually and compare results

Welcome!

pd.read_sql and pd.read_sql_query hang upon execution, how to troubleshoot this?

fonehe5161

New member

tagey73068

New member