With Python, fetch in Excel file the value of a cell in a row where a cell in the same row contains the character string from an xml file

Jackie · November 22, 2024, 11:20am

I have several xml files with names like :

LLL_ABC0D012_title.xml

LLL_ABC0D013_title.xml

LLL_ABT0G012_title.xml

LLL_ABR0N012_title.xml

There are always 8 characters after the ‘LLL_’.

I have an Excel file with over 900 lines that looks like this :

reference	dtp1	dtp2	dtp3
ABC0D012	1_blabla	1_1_blablabla	1_1_1_blablablabla
ABC0D013	1_blabla	1_1_blablabla	1_1_1_blablablabla
ABC0D014	1_blabla	1_1_blablabla	1_1_1_blablablabla
ABT0G012	1_blabla	1_1_blablabla	1_1_1_blablablabla

In my xml files, I have text, text, text tags. I would like to modify the text of these tags with the text in the cells of the corresponding columns in the table above.

To do this, I’d like my script to do a loop that reads ABC0D012 characters (and then the other) in my xml file title, find the match in the ‘reference’ column of my Excel file and then look for the corresponding value in ‘dtp1’, ‘dtp2’ and ‘dtp3’. I would then like to save these values in a variable to replace the text in my tags.

I’m a beginner in Python. I tried something like :

import numpy as np
import openpyxl
import pandas as pd
import xml.etree.ElementTree as ET
import os

table1 = pd.read_excel('C:/Users/Documents/datatypes.xlsx', na_values=['NA'])
table2 = table1.replace('\xa0', ' ',regex=True)

for root, dirs, files in os.walk("."):
    for file in files :
        if file[-4:] == '.xml':
            #print(file)
            xml = ET.parse('LLL_ABC0D012_title.xml')
            root = xml.getroot()
            dtp1_xml = root[8]
            dtp2_xml = root[9]          
            dtp3_xml = root[10]
            num = file[4:12]
            #print(num)
            dtp1_excel = table2['dtp1'].where(table2['reference'] == num)
            dtp1_xml.text = dtp1_excel
            #print(dtp1_xml.text)
            ET.indent(root)
            xml.write("LLL_ABC0D012_title.xml", encoding='utf-8', xml_declaration=True, method='xml')

But it doesn’t work, ‘dtp1_excel’ does not return the dtp1 value from the table in print function but something like :

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
      ...
931    NaN
932    NaN
933    NaN
934    NaN
935    NaN
Name: dtp1, Length: 936, dtype: object

I have no NaN value in the reference and dtp1 columns and it should only return a single cell value.

What’s wrong with my approach ? Could you help me, please ?

Finn · November 22, 2024, 11:29am

The issue lies in how you’re trying to retrieve the value from the Excel data. The .where() function doesn’t filter the DataFrame in place but rather returns a Series with NaN for non-matching rows. To extract the matching value, you should use a more direct approach like .loc.

Here’s an updated version of your script with the fixes applied:

Updated Script

import pandas as pd
import xml.etree.ElementTree as ET
import os

# Load the Excel file into a DataFrame
table1 = pd.read_excel('C:/Users/Documents/datatypes.xlsx', na_values=['NA'])
table2 = table1.replace('\xa0', ' ', regex=True)

# Walk through the current directory and find XML files
for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith('.xml'):
            # Parse the XML file
            xml = ET.parse(file)
            root = xml.getroot()

            # Extract the reference number from the file name
            reference_num = file[4:12]  # Extract characters after 'LLL_' (8 characters)

            # Find the row in the DataFrame that matches the reference number
            row = table2.loc[table2['reference'] == reference_num]
            if not row.empty:
                # Extract corresponding dtp1, dtp2, dtp3 values
                dtp1_excel = row['dtp1'].iloc[0]
                dtp2_excel = row['dtp2'].iloc[0]
                dtp3_excel = row['dtp3'].iloc[0]

                # Update the XML tags
                dtp1_xml = root.find("./dtp1")  # Replace 'dtp1' with the actual tag name
                dtp2_xml = root.find("./dtp2")  # Replace 'dtp2' with the actual tag name
                dtp3_xml = root.find("./dtp3")  # Replace 'dtp3' with the actual tag name

                if dtp1_xml is not None:
                    dtp1_xml.text = dtp1_excel
                if dtp2_xml is not None:
                    dtp2_xml.text = dtp2_excel
                if dtp3_xml is not None:
                    dtp3_xml.text = dtp3_excel

                # Save the updated XML file
                ET.indent(root)
                xml.write(file, encoding='utf-8', xml_declaration=True)
            else:
                print(f"No match found for {reference_num} in the Excel file.")

Explanation of Changes:

Extracting Rows Using loc:

The table2.loc[table2['reference'] == reference_num] is used to filter rows directly by the reference column. If a match exists, it will return the row(s); otherwise, it will be empty.

Accessing Values Using .iloc[0]:

Since row is a DataFrame, we extract the first (and only) row with .iloc[0] and access the desired column values.

Finding XML Tags:

Used root.find("./tag_name") to locate XML tags (dtp1, dtp2, dtp3). Replace tag_name with the actual XML tag names.

File Name Handling:

The script dynamically processes all XML files in the directory.

Error Handling:

If a reference is not found, it skips the file and prints a message.

Notes:

Make sure the dtp1, dtp2, and dtp3 tags exist in your XML files and their names match the ones used in root.find("./tag_name").
Adjust the file path for your Excel file and XML files as needed.
Test the script with a few files before running it on all files to ensure it behaves as expected.

Let me know if you need further assistance!